On Tue, 2 Apr 2002 13:45:06 -0800, jhi(_at_)iki(_dot_)fi (Jarkko Hietaniemi)
wrote:
Change 15689 by jhi(_at_)alpha on 2002/04/02 20:35:13
What started as a small nit (the charnames test, nit found
be Hugo), ballooned a bit... the goal is Larry's wish that
illegal Unicode (such as U+FFFF) by default doesn't warn,
since what if somebody WANTS to create illegal Unicode?
Now getting close to this in the regex runtime.
(Also, fix more of my fixation that BOM would be U+FFFE.)
Um, U+FFFE is just as illegal as U+FFFF, as I understand it (and
similarly with U+10FFFE etc.). Why did you remove it from the tests?
By my understanding, it would make more sense to keep it in, together
with U+FFFF, since it was already there. Even if it isn't your beloved
BOM ;)
See also further below (interspersed)
==== //depot/perl/t/lib/warnings/utf8#7 (text) ====
Index: perl/t/lib/warnings/utf8
--- perl/t/lib/warnings/utf8.~1~ Tue Apr 2 13:45:05 2002
+++ perl/t/lib/warnings/utf8 Tue Apr 2 13:45:05 2002
@@ -39,7 +39,6 @@
my $dfff = chr(0xDFFF);
my $e000 = chr(0xE000);
my $fffd = chr(0xFFFD);
-my $fffe = chr(0xFFFE);
my $ffff = chr(0xFFFF);
my $hex4 = chr(0x10000);
my $hex5 = chr(0x100000);
@@ -50,7 +49,6 @@
my $dfff = chr(0xDFFF);
my $e000 = chr(0xE000);
my $fffd = chr(0xFFFD);
-my $fffe = chr(0xFFFE);
my $ffff = chr(0xFFFF);
my $hex4 = chr(0x10000);
my $hex5 = chr(0x100000);
@@ -58,9 +56,8 @@
EXPECT
UTF-16 surrogate 0xd800 at - line 3.
UTF-16 surrogate 0xdfff at - line 4.
-Unicode character 0xfffe is illegal at - line 7.
-Unicode character 0xffff is illegal at - line 8.
-Unicode character 0x10ffff is illegal at - line 11.
+Unicode character 0xffff is illegal at - line 7.
+Unicode character 0x10ffff is illegal at - line 10.
########
use warnings 'utf8';
my $d7ff = pack("U", 0xD7FF);
Perhaps add a test for 0x10fffe as well?
@@ -68,7 +65,6 @@
my $dfff = pack("U", 0xDFFF);
my $e000 = pack("U", 0xE000);
my $fffd = pack("U", 0xFFFD);
-my $fffe = pack("U", 0xFFFE);
my $ffff = pack("U", 0xFFFF);
my $hex4 = pack("U", 0x10000);
my $hex5 = pack("U", 0x100000);
@@ -79,7 +75,6 @@
my $dfff = pack("U", 0xDFFF);
my $e000 = pack("U", 0xE000);
my $fffd = pack("U", 0xFFFD);
-my $fffe = pack("U", 0xFFFE);
my $ffff = pack("U", 0xFFFF);
my $hex4 = pack("U", 0x10000);
my $hex5 = pack("U", 0x100000);
@@ -87,9 +82,8 @@
EXPECT
UTF-16 surrogate 0xd800 at - line 3.
UTF-16 surrogate 0xdfff at - line 4.
-Unicode character 0xfffe is illegal at - line 7.
-Unicode character 0xffff is illegal at - line 8.
-Unicode character 0x10ffff is illegal at - line 11.
+Unicode character 0xffff is illegal at - line 7.
+Unicode character 0x10ffff is illegal at - line 10.
########
use warnings 'utf8';
my $d7ff = "\x{D7FF}";
@@ -97,7 +91,6 @@
my $dfff = "\x{DFFF}";
my $e000 = "\x{E000}";
my $fffd = "\x{FFFD}";
-my $fffe = "\x{FFFE}";
my $ffff = "\x{FFFF}";
my $hex4 = "\x{10000}";
my $hex5 = "\x{100000}";
@@ -108,7 +101,6 @@
my $dfff = "\x{DFFF}";
my $e000 = "\x{E000}";
my $fffd = "\x{FFFD}";
-my $fffe = "\x{FFFE}";
my $ffff = "\x{FFFF}";
my $hex4 = "\x{10000}";
my $hex5 = "\x{100000}";
@@ -116,6 +108,5 @@
EXPECT
UTF-16 surrogate 0xd800 at - line 3.
UTF-16 surrogate 0xdfff at - line 4.
-Unicode character 0xfffe is illegal at - line 7.
-Unicode character 0xffff is illegal at - line 8.
-Unicode character 0x10ffff is illegal at - line 11.
+Unicode character 0xffff is illegal at - line 7.
+Unicode character 0x10ffff is illegal at - line 10.
==== //depot/perl/utf8.c#184 (text) ====
Index: perl/utf8.c
--- perl/utf8.c.~1~ Tue Apr 2 13:45:05 2002
+++ perl/utf8.c Tue Apr 2 13:45:05 2002
@@ -64,13 +64,13 @@
((uv >= 0xFDD0 && uv <= 0xFDEF &&
!(flags & UNICODE_ALLOW_FDD0))
||
- ((uv & 0xFFFF) == 0xFFFE &&
- !(flags & UNICODE_ALLOW_FFFE))
+ (UNICODE_IS_BYTE_ORDER_MARK(uv) &&
+ !(flags & UNICODE_ALLOW_BOM))
BOM should always be allowed, as I understand it -- even if it's not at
the beginning, it's ZWNBSP (at least in Unicode 3.0) and perfectly fine
anywhere in a string. I'm not sure why it needs to be specifically
allowed.
On the other hand, it might make sense to have one flag that allows both
0xyyFFFE and 0xyyFFFF simultaneously (for yy = (0 .. 0x10)) -- perhaps
modify the UNICODE_ALLOW_FFFF flag or simly extend it to allow both
yyFFFF and yyFFFE..
||
((uv & 0xFFFF) == 0xFFFF &&
!(flags & UNICODE_ALLOW_FFFF))) &&
/* UNICODE_ALLOW_SUPER includes
- * FFFEs and FFFFs beyond 0x10FFFF. */
+ * FFFFs beyond 0x10FFFF. */
This should probably stay.
((uv <= PERL_UNICODE_MAX) ||
!(flags & UNICODE_ALLOW_SUPER))
)
[snip]
==== //depot/perl/utf8.h#57 (text) ====
Index: perl/utf8.h
--- perl/utf8.h.~1~ Tue Apr 2 13:45:05 2002
+++ perl/utf8.h Tue Apr 2 13:45:05 2002
@@ -188,24 +188,24 @@
#define UNICODE_SURROGATE_FIRST 0xd800
#define UNICODE_SURROGATE_LAST 0xdfff
#define UNICODE_REPLACEMENT 0xfffd
-#define UNICODE_BYTER_ORDER_MARK 0xfffe
+#define UNICODE_BYTE_ORDER_MARK 0xfeff
#define UNICODE_ILLEGAL 0xffff
/* Though our UTF-8 encoding can go beyond this,
- * let's be conservative. */
+ * let's be conservative and do as Unicode 3.2 says. */
#define PERL_UNICODE_MAX 0x10FFFF
#define UNICODE_ALLOW_SURROGATE 0x0001 /* Allow UTF-16 surrogates
(EVIL) */
#define UNICODE_ALLOW_FDD0 0x0002 /* Allow the U+FDD0...U+FDEF */
-#define UNICODE_ALLOW_FFFE 0x0004 /* Allow 0xFFFE, 0x1FFFE, ... */
-#define UNICODE_ALLOW_FFFF 0x0008 /* Allow 0xFFFE, 0x1FFFE, ... */
+#define UNICODE_ALLOW_BOM 0x0004 /* Allow 0xFEFF */
+#define UNICODE_ALLOW_FFFF 0x0008 /* Allow 0xFFFF, 0x1FFFF, ... */
#define UNICODE_ALLOW_SUPER 0x0010 /* Allow past 10xFFFF */
#define UNICODE_ALLOW_ANY 0xFFFF
Shouldn't need _ALLOW_BOM here, but perhaps extend _ALLOW_FFFF? Or just
revert the change and have _ALLOW_FFFE parallel to _ALLOW_FFFF (they're
both illegal characters). I just would have thought that one would
usually allow both or neither, not simply one or the other.
#define UNICODE_IS_SURROGATE(c) ((c) >= UNICODE_SURROGATE_FIRST
&& \
(c) <= UNICODE_SURROGATE_LAST)
#define UNICODE_IS_REPLACEMENT(c) ((c) == UNICODE_REPLACEMENT)
-#define UNICODE_IS_BYTE_ORDER_MARK(c) ((c) ==
UNICODE_BYTER_ORDER_MARK)
+#define UNICODE_IS_BYTE_ORDER_MARK(c) ((c) == UNICODE_BYTE_ORDER_MARK)
#define UNICODE_IS_ILLEGAL(c) ((c) == UNICODE_ILLEGAL)
#ifdef HAS_QUAD
End of Patch.
Cheers,
Philip