John Delacour <JD(_at_)BD8(_dot_)COM> writes:
At 11:31 am +0100 16/9/03, z(_dot_)xiao(_at_)lancaster(_dot_)ac(_dot_)uk wrote:
Dear PERLists,
I am running Perl 5.8. and trying to filter out some invalid Unicode
characters from Unicoded texts of some South Asian languages. There
are 28 such characters in my data (all control characters):
0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19,
0x1B, 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
0xB, 0xC, 0xF, 0xFFFF, 0xe
i.e.
0x00 ok
0x01..0x08 bad
0x09 (TAB) ok
0x0A (LF) ok
0x0B..0x0C bad
0x0D (CR) ok
0x0e..0x19 bad
0x1A ok (why!)
0x1b..0x1f bad
0x7f DEL ok (why?)
0x80..0x9F ok (why?)
0x100.0xFFFE ok
The "bad" ones in my re-ordered table are valid Unicode characters.
(0xFFFF isn't)
I think earlier advice to convert to perl form and tr/// them out
is best way to proceed.
The data is coded as utf-16 and I want to keep it this way when the
invalid characters are removed. Is there an easy way to do this with
Perl while keeping the textual quality intact?
Loosing 0x08 (BS) may loose you some over-strike.
In general removing things _may_ make textural quality non-intact
if that quality included fixed-length fields or the like.
Your question is not clear to me.
You complaint isn't clear to me ;-)
You say these are invalid Unicode
characters and then list 8-bit characters. Are you saying that
redundant "\x01" etc have got into the text somehow or that
"\x{0001}" etc. are there?
"\x01" and "\x{0001}" are the same thing.
Can you give us a sample of the offending
text. Are you saying it is like the UTF-16 equivalent of the output
of this? :
perl -e 'print qq~\x17\x{6017}\x18\x{6001}~'
JD