Re: Invalid Uicode characters

John Delacour <JD(_at_)BD8(_dot_)COM> writes:

At 11:31 am +0100 16/9/03, z(_dot_)xiao(_at_)lancaster(_dot_)ac(_dot_)uk wrote:

Dear PERLists,

I am running Perl 5.8. and trying to filter out some invalid Unicode 
characters from Unicoded texts of some South Asian languages. There 
are 28 such characters in my data (all control characters):

0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 
0x1B, 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 
0xB, 0xC, 0xF, 0xFFFF, 0xe


i.e. 
   0x00          ok
   0x01..0x08    bad
   0x09 (TAB)    ok
   0x0A (LF)     ok
   0x0B..0x0C    bad    
   0x0D  (CR)    ok
   0x0e..0x19    bad
   0x1A          ok   (why!)
   0x1b..0x1f    bad 
   0x7f  DEL     ok   (why?)
   0x80..0x9F    ok   (why?)
   0x100.0xFFFE  ok   

The "bad" ones in my re-ordered table are valid Unicode characters. 
(0xFFFF isn't)      

I think earlier advice to convert to perl form and tr/// them out 
is best way to proceed.


The data is coded as utf-16 and I want to keep it this way when the 
invalid characters are removed. Is there an easy way to do this with 
Perl while keeping the textual quality intact?


Loosing 0x08 (BS) may loose you some over-strike.
In general removing things _may_ make textural quality non-intact
if that quality included fixed-length fields or the like.


Your question is not clear to me.


You complaint isn't clear to me ;-)

You say these are invalid Unicode 
characters and then list 8-bit characters. Are you saying that 
redundant "\x01" etc have got into the text somehow or that 
"\x{0001}" etc. are there?


"\x01" and "\x{0001}" are the same thing.

Can you give us a sample of the offending 
text.  Are you saying it is like the UTF-16 equivalent of the output 
of this? :

perl -e 'print qq~\x17\x{6017}\x18\x{6001}~'

JD