Re: Pattern matching with Unicode (5.6.1)

On Thu, Aug 15, 2002 at 05:28:43PM -0400, David Gray wrote:

I'm having a bit of a problem getting Unicode pattern 
matching to do what I would like it to.


I guess my question wasn't entirely clear. I'm reading in the attatched
file and trying to split it on "\n\n".

When I'm looping over the file,

I've (sort of) made it work by doing:

 # strip BOM and trailing nulls and carriage returns
 s/^..// if $. == 1 and s/\0//g;
 s/[\0\r]//g;


The two-byte BOM has me thinking it's probably UTF-16. Is there an easy
way to tell what encoding a file uses?


Not that I know of, but all the 0 bytes make me think it is.

But I'm sure there must be a more elegant way to do this. 
Honestly, I'm not even sure where to start. Any ideas?


I find that this:

perl5.6.1  -we 'undef $/; $_=<STDIN>; $_ = pack "U*", unpack "v*", $_; substr 
($_, 0, 1) = ""; print $_' </tmp/unicode.txt

gives me this:

fdn "grp1",55,"","",0

fdn "grp2",55,"","",0

fdn "grp3",55,"","",0

fdn "grp4",55,"","",0

fdn "grp5",55,"","",0

fdn "TEMP",55,"","",0


The substr takes out the byte order mark.

I guess a better conversion script would read the first two characters, and
if they look like a byte order mark in UTF-16 chose whether to use v or n
in the unpack based on the endianness.

You will get more sane regexp behaviour if you use 5.8.0 rather than 5.6.1
In 5.6.1 being in the scope of a "use utf8;" will make your regexps properly
unicode, even if they don't contain obvious Unicode features.

(Otherwise matches involving . and similar metachars cause the regexp to think
in ASCII, and unicode scalars are treated as a series of bytes.
5.8.0 fixes this problem - regexps "just work" there. Modulo unknown bugs)

Nicholas Clark
-- 
Even better than the real thing:        http://nms-cgi.sourceforge.net/