At 11:31 am +0100 16/9/03, z(_dot_)xiao(_at_)lancaster(_dot_)ac(_dot_)uk wrote:
Dear PERLists,
I am running Perl 5.8. and trying to filter out some invalid Unicode
characters from Unicoded texts of some South Asian languages. There
are 28 such characters in my data (all control characters):
0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19,
0x1B, 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
0xB, 0xC, 0xF, 0xFFFF, 0xe
The data is coded as utf-16 and I want to keep it this way when the
invalid characters are removed. Is there an easy way to do this with
Perl while keeping the textual quality intact?
Your question is not clear to me. You say these are invalid Unicode
characters and then list 8-bit characters. Are you saying that
redundant "\x01" etc have got into the text somehow or that
"\x{0001}" etc. are there? Can you give us a sample of the offending
text. Are you saying it is like the UTF-16 equivalent of the output
of this? :
perl -e 'print qq~\x17\x{6017}\x18\x{6001}~'
JD