Re: Invalid Uicode characters

At 11:31 am +0100 16/9/03, z(_dot_)xiao(_at_)lancaster(_dot_)ac(_dot_)uk wrote:

Dear PERLists,
I am running Perl 5.8. and trying to filter out some invalid Unicodecharacters from Unicoded texts of some South Asian languages. Thereare 28 such characters in my data (all control characters):
0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19,0x1B, 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,0xB, 0xC, 0xF, 0xFFFF, 0xe
The data is coded as utf-16 and I want to keep it this way when theinvalid characters are removed. Is there an easy way to do this withPerl while keeping the textual quality intact?

Your question is not clear to me. You say these are invalid Unicodecharacters and then list 8-bit characters. Are you saying thatredundant "\x01" etc have got into the text somehow or that"\x{0001}" etc. are there? Can you give us a sample of the offendingtext. Are you saying it is like the UTF-16 equivalent of the outputof this? :


perl -e 'print qq~\x17\x{6017}\x18\x{6001}~'

JD

Previous by Date:	Re: Invalid Uicode characters, David Graff
Next by Date:	Re: Invalid Uicode characters, Nick Ing-Simmons
Previous by Thread:	Re: Invalid Uicode characters, David Graff
Next by Thread:	Re: Invalid Uicode characters, Nick Ing-Simmons
Indexes:	[Date] [Thread] [Top] [All Lists]