perl-unicode

Re: Change 15689: What started as a small nit (the charnames test, nit found

2002-04-05 12:54:19
On Tue, 2 Apr 2002 13:45:06 -0800, jhi(_at_)iki(_dot_)fi (Jarkko Hietaniemi) 
wrote:

Change 15689 by jhi(_at_)alpha on 2002/04/02 20:35:13

      What started as a small nit (the charnames test, nit found
      be Hugo), ballooned a bit... the goal is Larry's wish that
      illegal Unicode (such as U+FFFF) by default doesn't warn,
      since what if somebody WANTS to create illegal Unicode?
      Now getting close to this in the regex runtime.
      (Also, fix more of my fixation that BOM would be U+FFFE.)

Um, U+FFFE is just as illegal as U+FFFF, as I understand it (and
similarly with U+10FFFE etc.). Why did you remove it from the tests?

By my understanding, it would make more sense to keep it in, together
with U+FFFF, since it was already there. Even if it isn't your beloved
BOM ;)

See also further below (interspersed)

==== //depot/perl/t/lib/warnings/utf8#7 (text) ====
Index: perl/t/lib/warnings/utf8
--- perl/t/lib/warnings/utf8.~1~      Tue Apr  2 13:45:05 2002
+++ perl/t/lib/warnings/utf8  Tue Apr  2 13:45:05 2002
@@ -39,7 +39,6 @@
 my $dfff  = chr(0xDFFF);
 my $e000  = chr(0xE000);
 my $fffd  = chr(0xFFFD);
-my $fffe  = chr(0xFFFE);
 my $ffff  = chr(0xFFFF);
 my $hex4  = chr(0x10000);
 my $hex5  = chr(0x100000);
@@ -50,7 +49,6 @@
 my $dfff  = chr(0xDFFF);
 my $e000  = chr(0xE000);
 my $fffd  = chr(0xFFFD);
-my $fffe  = chr(0xFFFE);
 my $ffff  = chr(0xFFFF);
 my $hex4  = chr(0x10000);
 my $hex5  = chr(0x100000);
@@ -58,9 +56,8 @@
 EXPECT
 UTF-16 surrogate 0xd800 at - line 3.
 UTF-16 surrogate 0xdfff at - line 4.
-Unicode character 0xfffe is illegal at - line 7.
-Unicode character 0xffff is illegal at - line 8.
-Unicode character 0x10ffff is illegal at - line 11.
+Unicode character 0xffff is illegal at - line 7.
+Unicode character 0x10ffff is illegal at - line 10.
 ########
 use warnings 'utf8';
 my $d7ff  = pack("U", 0xD7FF);

Perhaps add a test for 0x10fffe as well?

@@ -68,7 +65,6 @@
 my $dfff  = pack("U", 0xDFFF);
 my $e000  = pack("U", 0xE000);
 my $fffd  = pack("U", 0xFFFD);
-my $fffe  = pack("U", 0xFFFE);
 my $ffff  = pack("U", 0xFFFF);
 my $hex4  = pack("U", 0x10000);
 my $hex5  = pack("U", 0x100000);
@@ -79,7 +75,6 @@
 my $dfff  = pack("U", 0xDFFF);
 my $e000  = pack("U", 0xE000);
 my $fffd  = pack("U", 0xFFFD);
-my $fffe  = pack("U", 0xFFFE);
 my $ffff  = pack("U", 0xFFFF);
 my $hex4  = pack("U", 0x10000);
 my $hex5  = pack("U", 0x100000);
@@ -87,9 +82,8 @@
 EXPECT
 UTF-16 surrogate 0xd800 at - line 3.
 UTF-16 surrogate 0xdfff at - line 4.
-Unicode character 0xfffe is illegal at - line 7.
-Unicode character 0xffff is illegal at - line 8.
-Unicode character 0x10ffff is illegal at - line 11.
+Unicode character 0xffff is illegal at - line 7.
+Unicode character 0x10ffff is illegal at - line 10.
 ########
 use warnings 'utf8';
 my $d7ff  = "\x{D7FF}";
@@ -97,7 +91,6 @@
 my $dfff  = "\x{DFFF}";
 my $e000  = "\x{E000}";
 my $fffd  = "\x{FFFD}";
-my $fffe  = "\x{FFFE}";
 my $ffff  = "\x{FFFF}";
 my $hex4  = "\x{10000}";
 my $hex5  = "\x{100000}";
@@ -108,7 +101,6 @@
 my $dfff  = "\x{DFFF}";
 my $e000  = "\x{E000}";
 my $fffd  = "\x{FFFD}";
-my $fffe  = "\x{FFFE}";
 my $ffff  = "\x{FFFF}";
 my $hex4  = "\x{10000}";
 my $hex5  = "\x{100000}";
@@ -116,6 +108,5 @@
 EXPECT
 UTF-16 surrogate 0xd800 at - line 3.
 UTF-16 surrogate 0xdfff at - line 4.
-Unicode character 0xfffe is illegal at - line 7.
-Unicode character 0xffff is illegal at - line 8.
-Unicode character 0x10ffff is illegal at - line 11.
+Unicode character 0xffff is illegal at - line 7.
+Unicode character 0x10ffff is illegal at - line 10.

==== //depot/perl/utf8.c#184 (text) ====
Index: perl/utf8.c
--- perl/utf8.c.~1~   Tue Apr  2 13:45:05 2002
+++ perl/utf8.c       Tue Apr  2 13:45:05 2002
@@ -64,13 +64,13 @@
                ((uv >= 0xFDD0 && uv <= 0xFDEF &&
                  !(flags & UNICODE_ALLOW_FDD0))
                 ||
-                ((uv & 0xFFFF) == 0xFFFE &&
-                 !(flags & UNICODE_ALLOW_FFFE))
+                (UNICODE_IS_BYTE_ORDER_MARK(uv) &&
+                 !(flags & UNICODE_ALLOW_BOM))

BOM should always be allowed, as I understand it -- even if it's not at
the beginning, it's ZWNBSP (at least in Unicode 3.0) and perfectly fine
anywhere in a string. I'm not sure why it needs to be specifically
allowed.

On the other hand, it might make sense to have one flag that allows both
0xyyFFFE and 0xyyFFFF simultaneously (for yy = (0 .. 0x10)) -- perhaps
modify the UNICODE_ALLOW_FFFF flag or simly extend it to allow both
yyFFFF and yyFFFE..

                 ||
                 ((uv & 0xFFFF) == 0xFFFF &&
                  !(flags & UNICODE_ALLOW_FFFF))) &&
                /* UNICODE_ALLOW_SUPER includes
-                * FFFEs and FFFFs beyond 0x10FFFF. */
+                * FFFFs beyond 0x10FFFF. */

This should probably stay.

                ((uv <= PERL_UNICODE_MAX) ||
                 !(flags & UNICODE_ALLOW_SUPER))
                )
[snip]
==== //depot/perl/utf8.h#57 (text) ====
Index: perl/utf8.h
--- perl/utf8.h.~1~   Tue Apr  2 13:45:05 2002
+++ perl/utf8.h       Tue Apr  2 13:45:05 2002
@@ -188,24 +188,24 @@
 #define UNICODE_SURROGATE_FIRST              0xd800
 #define UNICODE_SURROGATE_LAST               0xdfff
 #define UNICODE_REPLACEMENT          0xfffd
-#define UNICODE_BYTER_ORDER_MARK     0xfffe
+#define UNICODE_BYTE_ORDER_MARK              0xfeff
 #define UNICODE_ILLEGAL                      0xffff
 
 /* Though our UTF-8 encoding can go beyond this,
- * let's be conservative. */
+ * let's be conservative and do as Unicode 3.2 says. */
 #define PERL_UNICODE_MAX     0x10FFFF
 
 #define UNICODE_ALLOW_SURROGATE 0x0001       /* Allow UTF-16 surrogates 
(EVIL) */
 #define UNICODE_ALLOW_FDD0   0x0002  /* Allow the U+FDD0...U+FDEF */
-#define UNICODE_ALLOW_FFFE   0x0004  /* Allow 0xFFFE, 0x1FFFE, ... */
-#define UNICODE_ALLOW_FFFF   0x0008  /* Allow 0xFFFE, 0x1FFFE, ... */
+#define UNICODE_ALLOW_BOM    0x0004  /* Allow 0xFEFF */
+#define UNICODE_ALLOW_FFFF   0x0008  /* Allow 0xFFFF, 0x1FFFF, ... */
 #define UNICODE_ALLOW_SUPER  0x0010  /* Allow past 10xFFFF */
 #define UNICODE_ALLOW_ANY    0xFFFF

Shouldn't need _ALLOW_BOM here, but perhaps extend _ALLOW_FFFF? Or just
revert the change and have _ALLOW_FFFE parallel to _ALLOW_FFFF (they're
both illegal characters). I just would have thought that one would
usually allow both or neither, not simply one or the other.

 #define UNICODE_IS_SURROGATE(c)              ((c) >= UNICODE_SURROGATE_FIRST 
&& \
                                       (c) <= UNICODE_SURROGATE_LAST)
 #define UNICODE_IS_REPLACEMENT(c)    ((c) == UNICODE_REPLACEMENT)
-#define UNICODE_IS_BYTE_ORDER_MARK(c)        ((c) == 
UNICODE_BYTER_ORDER_MARK)
+#define UNICODE_IS_BYTE_ORDER_MARK(c)        ((c) == UNICODE_BYTE_ORDER_MARK)
 #define UNICODE_IS_ILLEGAL(c)                ((c) == UNICODE_ILLEGAL)
 
 #ifdef HAS_QUAD
End of Patch.

Cheers,
Philip

<Prev in Thread] Current Thread [Next in Thread>