perl-unicode

Re: Fallback problems with Encode

2002-12-28 15:30:05
On December 28, 2002 at 20:51, Nick Ing-Simmons wrote:

BTW, in the t/fallbacks.t test case of Encode, 8-bit characters are
used for the ascii test, and entity references are generated for the
8-bit characters.

As I stated in my original post, the problem is that t/fallbacks.t
tests an undocumented (or poorly documented) Encode interface, and
it does not test the well-documented interface.

Whether un(der)?documented or not the object style used in t/fallback.t 
is the way the internals work. 

But t/fallback.t fails to properly test what is clearly documented
as the API in the docs.  t/fallback.t *should* also test the
documented API functions and not just how the internals work.  IMO,
any experience test engineer will agree with this assessment.

You say "... it is impractical to maintain unique
conversion tables between all types of character encodings." - it is even 
more impractical to _test_ them that way.

Agreed, but in testing, you can have cases that can be used to represent
a class of cases.  In this case, a conversion from one set to another
where the original set contains octets/bytes/characters that are
undefined (e.g. ISO-8859-3).  There is no need to test all possible
combinations.

So why doesn't the from_to() usage generate the same results?

Because the ->decode side has removed the non-representable octets
and replaced them with 4-chars each: \xHH. 
So there are no hi-bit chars to cause entity refs.

This is the explanation I was looking for.  I.e. from_to() is not
"atomic", it is really a two step process (which is obvious for the
technically inclined when thinking about how the internals may work,
but the fallback flags are also impacted by the two step process).
This should be documented since the sematics of the fallback flags
as documented are not preserved across the from_to() process.

If it was "atomic" the ->decode side would _not_ remove the
non-representable octets and replaced them with 4-chars, but "passed
them through" to the ->encode side so the fallback flags would have
the predicted effects.  Now I know doing this may complicate the
implementation of from_to().  Therefore, the sematics of the fallback
flags should be documented for from_to() or not supported at all,
maybe by issuing a warning.

You can get that (I believe) by passing appropriate fallback options to 
->decode of ASCII. I personally dislike fallback to '?' as it looses 
information in a way that is hard to back-track - which is why default 
fallback is \xHH.

Reasonable.  Note, this behavior, wrt from_to(), highlights the
confusion for the user of from_to().  When FB_XMLCREF is specified, and
all of sudden \xHH's show up, it implies that FB_PERLQQ was being used.

Maybe I am misunderatanding Encode's conversion operations, so
maybe it is a problem with the documentation not being clear about
this behavior.  But IMHO, what I am getting appears to be incorrect.

And IMHO you are getting what I "designed" it to produce ;-) 

As I like to say, "works as coded." ;-)

I strongly recommend doing conversions in two steps explcitly - that way 
you can get whatever you want.

I find the from_to() much more convenient code-wise.  I think the
limitations of from_to() should be documented, or its used deprecated
since it appears to be just a wrapper around ->decode ->encode.
Note, some may think that calling from_to() may be slightly more
efficient than doing the ->decode ->encode directly (i.e. from_to()
could be an XS routine and/or short-cut some steps).  If this is not
the case, why even bother having from_to()?

I am also willing to concede that documentation could be improved :-)

Of course, no one reads the documentation :-)

--ewh

<Prev in Thread] Current Thread [Next in Thread>