perl-unicode

Detecting malformed characters in files opened with '<:encoding(something)'

2010-10-03 18:29:22
Dear List,

Various places in the Perl docs say, with good and sufficient reason, that when 
reading a UTF-8 file, it should be opened '<:encoding(utf8)' rather than 
'<:utf8'.

The thing is, nowhere can I find documented what happens when a malformed 
character is encountered, or how to affect this. The perluni* documentation 
(intro, tut, code, and faq) deals only with the case of the Encode::decode, in 
which the CHECK argument is exposed. The 'perldoc -f open', 'perldoc -f 
readline', and 'perldoc perlop' documentation are, to my reading at least, 
equally silent on the handling of malformed characters. The last two say that 
operating system errors from reads show up in $!, but this isn't really an 
operating system error, and $! seems _not_ to be set on decode errors.

My reading in this mailing list's archives uncovered PerlIO::encode. But the 
default $PerlIO::encode::fallback _ought_ to give a warning when a malformed 
character is encountered, and I surely can't make it do this.

I have experimented in several versions of Perl with the requisite Unicode 
support (5.8.8, 5.8.9, 5.10.1, 5.12.0, 5.12.1, and 5.12.2) using the attached 
script. All treat the malformed character as end-of-file, and none returns any 
sort of error that I can find, except for 5.10.1, which sets $! to 'Bad file 
descriptor' somewhere along the way.

So my questions are: when reading a file opened with '<:encoding(something)',

* Is the behavior on encountering a malformed character documented anywhere?

* If so, where?

* Is there a way to alter this behavior (say, by replacing the malformed data 
with a replacement character a la decode())?

* Is there any way for the Perl script that is doing the reading to find out 
why it failed to get any more data?

Thank you very much for your time and attention,

Tom Wyant (mailing address to the contrary notwithstanding)

Attachment: encoding
Description: Perl program

<Prev in Thread] Current Thread [Next in Thread>