Martin_Hosken(_at_)sil(_dot_)org writes:
: Is there any way that we could use tr/// to do 8-bit to Unicode conversions
: simply? I am invisaging something like:
:
: tr/[\x80-\x9f]/\x20ac\x..../U;
:
: or the like whereby the lhs of the tr is considered in binary and the rhs in
: UTF-8.
Already been done, though what you want looks more like
tr/\x80-\x9f/\x{20ac}\x..../CU;
: Likewise for reverse conversion you could have UTF8 on the lhs and 8-bit
: clean on the rhs.
Just use UC instead of CU.
: The only difficulty here is that you would want an extra code on the rhs
: to be used for the 'out of range' code (what happens when a code >256
: isn't matched and converted, you want a default character inserted rather
: than the thing deleted).
The tr/// operator already has a mechanism for defaults, in that it
replicates the last character of the rhs if it's too short. Also,
the rule is that if a given character is specified more than once, the
first translation is used. So
tr/a\0-\x{10ffff}/bX/UC;
should translate a to b and every thing else to X. It should even do
it fairly effiently, since chunks of table aren't allocated unless
needed.
Larry