nmh-workers
[Top] [All Lists]

Re: mhfixmsg character set conversion

2022-02-09 18:50:10
Really.  I'm not making this up. :-/

No, I don't think you are.  I think that line in both files is correctly
UTF-8 encoded.

And now that you've explained what's going on, it's clear that you're
right.


vim isn't the vi(1) I grew up with, and probably you too.

Definitely.  The first time I used vi was in 1984, on a 68000-based Cadmus
system.


Try ‘:se fileencoding?’ when vim-ing good and again with bad.

Good point:

   $ vim good
   :set fileencoding
   fileencoding=utf-8

   $ vim bad
   :set fileencoding
   fileencoding=latin1


I expect the bad file has something earlier on which fixes vim's idea of
the encoding to ISO 8859-1

That does seem to be the case.  Do you have any idea what kind of thing
that might be?  (I know you can't diagnose a file you haven't seen, but in
general, what sorts of things should I look for?)


But wait.  It gets worse:

   $ grep -n ^Veuillez good | cut -c1-68
   108:Veuillez ne pas répondre au présent courriel. Il a été gén�

   $ grep -n ^Veuillez bad | cut -c1-68
   108:Veuillez ne pas répondre au présent courriel. Il a été gén�

The worse being it is the very same line 108 you're seeing in vim which
grep is also showing?

Exactly, because...


(The ‘�’ at the end is to be expected.)

...this is still more evidence that you know more about character sets and
conversions than I do.  As if further evidence was needed at this point. :-/

Until now, I've only ever seen that glyph when a character doesn't exist in
the font being used -- but that can't be the case here because that same
character is shown correctly five times in the same line of output.

Why is it to be expected?


   $ LC_ALL=C perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
[...]

I don't understand that.  The -p sets up a loop to read a line from
good_snippet, do the substitution on it, and print the result, until
EOF.  The -l strips off the linefeed on input and puts it back on the
output.  The substitution in between changes all bytes, thanks to
LC_ALL=C, which aren't space to tilde into a ‘<42>’ string representing
their hex value.

Thank you for explaining that.

Just for fun, I tried the following in tcsh:

   $ setenv LC_ALL C
   $ perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
   Veuillez ne pas r<c3><a9>pondre au pr<c3><a9>sent courriel. Il a 
<c3><a9>t<c3><a9> g<c3><a9>n<c3><a9>r<c3><a9>

As expected, this returned pretty much instantly.  Then I tried this:

   $ sh
   $ LC_ALL=C
   $ echo $LC_ALL
   C
   $ perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet

...and that also hung.  Which in a way is good, because at least it means
bash is behaving consistently.  But also not good, because it's behaving
badly. :-/

On my system, /bin/sh is a symlink to /bin/bash, which is version 5.1.016-2
as packaged by Manjaro.

...but troubleshooting bash is far outside the scope of this discussion, so
I propose to forget this particular clupea harengus of the crimson variety
unless you find it interesting in and of itself.


Nothing wrong with od(1).  If you have hexdump(1) installed then it with
-C gives quite nice output.

Yes, I see (or -C? :-).  Thanks for that tip; I hadn't known that hexdump
existed.


...and both snippets are identical!

Well, those lines were identical to start with before snipping.
You could confirm this with

   cmp <(sed -n 108p good) <(sed -n 108p bad)

As written, this also hangs in bash (and is invalid syntax in tcsh).

But it's effectively equivalent to

   $ sed -n 108p good > good.sed
   $ sed -n 108p bad  >  bad.sed 
   $ cmp good.sed bad.sed
   $ echo $?
   0

...which behaves as expected.


Strangely, both snippet files look fine in vim.

Because you have chopped off the non-UTF-8 which occurs earlier in bad
which fixes vim's idea of the file's encoding.

In retrospect this should have been obvious. :-/


...but for the bad file, that becomes

   "bad" [converted] 336 lines, 49471 bytes         1,1           Top

Ta-da!

Indeed. :-)

Thank you.

     - Steven
-- 
___________________________________________________________________________
Steven Winikoff      |
Montreal, QC, Canada |             Eschew obfuscation.
smw@smwonline.ca     |
http://smwonline.ca  |

<Prev in Thread] Current Thread [Next in Thread>