perl-unicode

Re: [PATCH]s and questions [Encode] 1.30

2002-04-08 12:34:30
On Mon, 8 Apr 2002 15:24:57 +0400, tagunov(_at_)motor(_dot_)ru (Anton Tagunov)
wrote:

2) [PATCH], thanks to Philip Newton

--- E:\anth\tmp\perl\b2\ext\Encode-1.30\lib\Encode\Supported.pod.orig   Mon 
Apr  8 14:06:12 2002
+++ E:\anth\tmp\perl\b2\ext\Encode-1.30\lib\Encode\Supported.pod        Mon 
Apr  8 15:18:34 2002
@@ -592,7 +592,7 @@
 JIS has not endorsed the full Microsoft standard however.
 The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
 subsets, while Microsoft has always been meaning C<Shift_JIS> to
-encode a wider character repertoire, see C<IANA> registration for
+encode a wider character repertoire. See C<IANA> registration for
 C<Windows-31J>.
 
 As a historical predecessor Microsoft's variant
@@ -600,7 +600,7 @@
 that Microsoft shouldn't have used JIS as part of the name
 in the first place.
 
-Unabiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
+Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
 
 Encode separately supports C<Shift_JIS> and C<cp932>.

This bit appears not to have been applied?

Here it is again, together with another few tweaks to Encode::Unicode.

--- ext/Encode/lib/Encode/Supported.pod.orig    Mon Apr  8 14:41:01 2002
+++ ext/Encode/lib/Encode/Supported.pod Mon Apr  8 20:21:23 2002
@@ -592,7 +592,7 @@
 JIS has not endorsed the full Microsoft standard however.
 The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
 subsets, while Microsoft has always been meaning C<Shift_JIS> to
-encode a wider character repertoire, see C<IANA> registration for
+encode a wider character repertoire. See C<IANA> registration for
 C<Windows-31J>.

 As a historical predecessor Microsoft's variant
@@ -600,7 +600,7 @@
 that Microsoft shouldn't have used JIS as part of the name
 in the first place.

-Unabiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
+Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.

 Encode separately supports C<Shift_JIS> and C<cp932>.

--- ext/Encode/lib/Encode/Unicode.pm.orig       Mon Apr  8 14:41:01 2002
+++ ext/Encode/lib/Encode/Unicode.pm    Mon Apr  8 20:30:50 2002
@@ -74,9 +74,9 @@
 sub new_sequence { $_[0] };

 #
-# the two implementation of (en|de)code exist.  *_modern use
-# array and *_classic stick with substr.  *_classic is much
-# slower but more memory conservative.  *_moder is default.
+# two implementation of (en|de)code exist.  *_modern use
+# an array and *_classic stick with substr.  *_classic is much
+# slower but more memory conservative.  *_modern is default.

 sub set_transcoder{
     no warnings qw(redefine);
@@ -311,26 +311,26 @@
 =head2 by Size

 UCS-2 is a fixed-length encoding with each character taking 16 bits.
-It B<does not> support I<Surrogate Pair>.  When surrogate pair is
-encountered during decode(), it fills its place with \xFFFD without
-I<CHECK> or croaks if I<CHECK>.  When a character which ord value is
-larger than 0xFFFF, it uses 0xFFFD without I<CHECK> or croaks if
-<CHECK>.
+It B<does not> support I<Surrogate Pairs>.  When a surrogate pair is
+encountered during decode(), its place is filled with \xFFFD without
+I<CHECK> or croaks if I<CHECK>.  When a character whose ord value is
+larger than 0xFFFF is encountered, it uses 0xFFFD without I<CHECK> or
+croaks if <CHECK>.

-UTF-16 is almost the same as UCS-2 but it supports I<Surrogate Pair>.
+UTF-16 is almost the same as UCS-2 but it supports I<Surrogate Pairs>.
 When it encounters a high surrogate (0xD800-0xDBFF), it fetches the
-following low surrogate (0xDC00-0xDFFF), C<desurrogate> them to form a
+following low surrogate (0xDC00-0xDFFF), C<desurrogate>s them to form a
 character.  Bogus surrogates result in death.  When \x{10000} or above
-is encountered during encode(), it C<ensurrogate>s them and push the
+is encountered during encode(), it C<ensurrogate>s them and pushes the
 surrogate pair to the output stream.

 UTF-32 is a fixed-length encoding with each character taking 32 bits.
-Since it is 32-bit there is no need for I<Surrogate Pair>.
+Since it is 32-bit there is no need for I<Surrogate Pairs>.

 =head2 by Endianness

 First (and now failed) goal of Unicode was to map all character
-repartories into a fixed-length integer so programmers are happy.
+repertories into a fixed-length integer so programmers are happy.
 Since each character is either I<short> or I<long> in C, you have to
 put endianness of each platform when you pass data to one another.

@@ -345,7 +345,7 @@
             16         32 bits/char
 -------------------------
 BE     0xFeFF 0x0000FeFF
-LE      0xFFeF 0xFeFF0000
+LE      0xFFeF 0xFFFe0000
 -------------------------

 =back
@@ -363,7 +363,7 @@

 When BE or LE is omitted during decode(), it checks if BOM is in the
 beginning of the string and if found endianness is set to what BOM
-says.  if not found, dies.
+says.  If not found, dies.

 =item *

@@ -378,21 +378,22 @@
 UCS-2 is already registered by IANA and others that way.


-=head1 The Surrogate Pair
+=head1 Surrogate Pairs

-To say the least, surrogate pair was the biggest mistake by Unicode
-Consortium.  I don't give a darn if they admit it or not.  But
-according to late Douglas Adams in I<The Hitchhiker's Guide to the
-Galaxy> Triology,  C<First the Universe was created and it was a bad
-move>. Their mistake was not this magnitude so let's forgive them.
+To say the least, surrogate pairs were the biggest mistake of the
+Unicode Consortium.  I don't give a darn if they admit it or not.  But
+according to the late Douglas Adams in I<The Hitchhiker's Guide to the
+Galaxy> Trilogy,  I<In the beginning the Universe was created. This
+has made a lot of people very angry and been widely regarded as a bad
+move>. Their mistake was not of this magnitude so let's forgive them.

 (I don't dare make any comparison with Unicode Consortium and the
 Vogons here ;)  Or, comparing Encode to Babel Fish is completely
 appropriate -- if you can only stick this into your ear :)

-A surrogate pair was born when Unicode Consortium had finally
-admitted that 16 bit was not big enough to hold all the world's
-character repartorie. But they have already made UCS-2 16-bit.  What
+Surrogate pairs were born when Unicode Consortium finally
+admitted that 16 bits were not big enough to hold all the world's
+character repertoire. But they have already made UCS-2 16-bit.  What
 do we do?

 Back then 0xD800-0xDFFF was not allocated.  Let's split them half and
@@ -401,7 +402,7 @@
 * 1024 = 1048576 more characters.  Now we can store character ranges
 up to \x{10ffff} even with 16-bit encodings.  This pair of
 half-character is now called a I<Surrogate Pair> and UTF-16 is the
-name of encoding that embraces them.
+name of the encoding that embraces them.

 Here is a fomula to ensurrogate a Unicode character \x{10000} and
 above;
@@ -413,8 +414,8 @@

  $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);

-Note this move has made \x{D800}-\x{DFFF} forbidden zone  but perl
-does not prohibit them for uses.
+Note this move has made \x{D800}-\x{DFFF} into a forbidden zone but
+perl does not prohibit the use of this range of characters.

 =head1 SEE ALSO

End of patch.

<Prev in Thread] Current Thread [Next in Thread>