Re: mhfixmsg character set conversion

BTW, to begin a thread, please don't reply to an existing message
on the list and change the subject


That makes sense, but (a) I wasn't trying to start a new thread, and
(b) I replied to an existing message without changing the subject.

I'll try to remember that for future reference, but I don't understand
why you mentioned it here and now.

...but when I look at the files with command-line tools such as more or
head, *both* versions look correct.


Have you patched more or head?  ;-)


No, but that's a fair question. :-)

They're both unpatched, installed as part of util-linux 2.37.3-2 (for more)
and coreutils 9.0-2 (for head) on Manjaro Linux.

Can you cut-and-paste commands and output from your terminal to show us
the problem.


Of course.

Otherwise we have to trust your competency, no offence intended,


None taken.  It's a perfectly fair request.

Here's my go.

How I could be influencing programs.

   $ locale
   LANG=en_GB.utf8
   LC_CTYPE="en_GB.utf8"
   LC_NUMERIC="en_GB.utf8"
   LC_TIME="en_GB.utf8"
   LC_COLLATE="en_GB.utf8"
   LC_MONETARY="en_GB.utf8"
   LC_MESSAGES="en_GB.utf8"
   LC_PAPER="en_GB.utf8"
   LC_NAME="en_GB.utf8"
   LC_ADDRESS="en_GB.utf8"
   LC_TELEPHONE="en_GB.utf8"
   LC_MEASUREMENT="en_GB.utf8"
   LC_IDENTIFICATION="en_GB.utf8"
   LC_ALL=
   $


Mine's 

   $ locale
   LANG=en_CA.UTF-8
   LC_CTYPE="en_CA.UTF-8"
   LC_NUMERIC="en_CA.UTF-8"
   LC_TIME="en_CA.UTF-8"
   LC_COLLATE=C
   LC_MONETARY="en_CA.UTF-8"
   LC_MESSAGES="en_CA.UTF-8"
   LC_PAPER="en_CA.UTF-8"
   LC_NAME="en_CA.UTF-8"
   LC_ADDRESS="en_CA.UTF-8"
   LC_TELEPHONE="en_CA.UTF-8"
   LC_MEASUREMENT="en_CA.UTF-8"
   LC_IDENTIFICATION="en_CA.UTF-8"
   LC_ALL=

Test inputs.

   $ cat good
   Veuillez ne pas répondre au présent courriel. Il a été généré
   automatiquement, nous ne pourrons pas y donner suite.
   $ cat bad
   Veuillez ne pas rÃ©pondre au prÃ©sent courriel. Il a Ã©tÃ© gÃ©nÃ©rÃ©
   automatiquement, nous ne pourrons pas y donner suite.
   $


In my case I don't have just the one sentence in a file by itself, but
let's try grep (unpatched, and installed from grep 3.7-1 on Manjaro):

   $ grep ^Veuillez good | cut -c1-68
   Veuillez ne pas répondre au présent courriel. Il a été généré

   $ grep ^Veuillez bad | cut -c1-68
   Veuillez ne pas répondre au présent courriel. Il a été généré

Really.  I'm not making this up. :-/

...but if I open the incorrect output file in vim and go to line 108,
I see this (pasted from an xterm in which vim was running):

   Veuillez ne pas rÃ©pondre au prÃ©sent courriel. Il a Ã©tÃ© gÃ©nÃ©rÃ©

But wait.  It gets worse:

   $ grep -n ^Veuillez good | cut -c1-68
   108:Veuillez ne pas répondre au présent courriel. Il a été gén�

   $ grep -n ^Veuillez bad | cut -c1-68
   108:Veuillez ne pas répondre au présent courriel. Il a été gén�

Is my shell somehow getting involved?

   $ echo $SHELL
   /usr/bin/tcsh

That's (also unpatched :-) tcsh 6.23.02-1 from Manjaro's tcsh package.

bad is double-encoded.

   $ iconv -f iso-8859-1 -t utf-8 good | cmp - bad
   $


I understand that, although I don't understand why that's happening.

head(1) and more(1) don't disguise that.


They certainly shouldn't, but:

   $ head -108 bad | tail -1 | cut -c1-68
   Veuillez ne pas répondre au présent courriel. Il a été généré

If you tell me this shouldn't be happening, I'll agree 100%.  But somehow
it is happening and I have no idea why.

Show the hex values of non-ASCII bytes.


I can't do that on the whole file, so I did this:

   $ cp -p good good_snippet
   $ cp -p bad bad_snippet
   $ vi good_snippet bad_snippet
        # delete all but the relevant part of line 108

   $ LC_ALL=C perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet

...but nothing appeared to happen, and I killed the command after waiting
about a minute.  (...and yes, I tried that in a bash subshell because I know
that syntax won't work in tcsh).

My perl is a bit rusty, so I'm not sure exactly how this command works.
However, just to muddy the waters even further, I fell back on od:

   $ od -t x1c good_snippet 
   0000000  56  65  75  69  6c  6c  65  7a  20  6e  65  20  70  61  73  20
             V   e   u   i   l   l   e   z       n   e       p   a   s    
   0000020  72  c3  a9  70  6f  6e  64  72  65  20  61  75  20  70  72  c3
             r 303 251   p   o   n   d   r   e       a   u       p   r 303
   0000040  a9  73  65  6e  74  20  63  6f  75  72  72  69  65  6c  2e  20
           251   s   e   n   t       c   o   u   r   r   i   e   l   .    
   0000060  49  6c  20  61  20  c3  a9  74  c3  a9  20  67  c3  a9  6e  c3
             I   l       a     303 251   t 303 251       g 303 251   n 303
   0000100  a9  72  c3  a9  0a
           251   r 303 251  \n
   0000105

   $ od -t x1c bad_snippet 
   0000000  56  65  75  69  6c  6c  65  7a  20  6e  65  20  70  61  73  20
             V   e   u   i   l   l   e   z       n   e       p   a   s    
   0000020  72  c3  a9  70  6f  6e  64  72  65  20  61  75  20  70  72  c3
             r 303 251   p   o   n   d   r   e       a   u       p   r 303
   0000040  a9  73  65  6e  74  20  63  6f  75  72  72  69  65  6c  2e  20
           251   s   e   n   t       c   o   u   r   r   i   e   l   .    
   0000060  49  6c  20  61  20  c3  a9  74  c3  a9  20  67  c3  a9  6e  c3
             I   l       a     303 251   t 303 251       g 303 251   n 303
   0000100  a9  72  c3  a9  0a
           251   r 303 251  \n
   0000105

...and both snippets are identical!  Suddenly I understand even less than
I did when I started writing this reply. :-(

Strangely, both snippet files look fine in vim.  But the original bad file
still looks bad in vim, and I'm at a loss for how to prove that except by
taking a screen shot, so I've done that and attached the result as a 34 Kb
PDF file.

One additional fact which must be relevant although I don't know enough
to say exactly how is that the status bar in vim looks like this when
the good file is newly opened:

   "good" 836 lines, 50844 bytes                    1,1           Top

...but for the bad file, that becomes

   "bad" [converted] 336 lines, 49471 bytes         1,1           Top

The smaller number of lines is expected (that's the effect of my
no-longer-wanted patch to mhfixmsg), but does that also explain the
different number of bytes?

More importantly, vim explicitly claims that the bad file is "[converted]",
so maybe that's the source of the double encoding?

The more I try to think about this, the more my head hurts. :-(

     - Steven
-- 
___________________________________________________________________________
Steven Winikoff      |
Montreal, QC, Canada | "Do not meddle in the affairs of dragons,
smw@smwonline.ca     |  for you are crunchy and good with ketchup."
http://smwonline.ca  |

bad.pdf
Description: bad.pdf

<Prev in Thread]	Current Thread	[Next in Thread>
Re: mhfixmsg character set conversion, (continued) Re: mhfixmsg character set conversion, Ken Hornstein Re: mhfixmsg character set conversion, Steven Winikoff Re: mhfixmsg character set conversion, David Levine Re: mhfixmsg character set conversion, David Levine Re: mhfixmsg character set conversion, Steven Winikoff Re: mhfixmsg character set conversion, David Levine Re: mhfixmsg character set conversion, Steven Winikoff Re: mhfixmsg character set conversion, David Levine Re: mhfixmsg character set conversion, Steven Winikoff Re: mhfixmsg character set conversion, Ralph Corderoy Re: mhfixmsg character set conversion, Steven Winikoff <= Re: mhfixmsg character set conversion, Ralph Corderoy Re: mhfixmsg character set conversion, Steven Winikoff Re: mhfixmsg character set conversion, George Michaelson Re: mhfixmsg character set conversion, George Michaelson Re: mhfixmsg character set conversion, Steven Winikoff Re: mhfixmsg character set conversion, Ralph Corderoy Re: mhfixmsg character set conversion, Steven Winikoff Re: mhfixmsg character set conversion, Robert Elz Re: mhfixmsg character set conversion, Steven Winikoff Re: mhfixmsg character set conversion, Robert Elz

Previous by Date:	Re: automatic decode mime in repl, David Levine
Next by Date:	Re: mhfixmsg character set conversion, Steven Winikoff
Previous by Thread:	Re: mhfixmsg character set conversion, Ralph Corderoy
Next by Thread:	Re: mhfixmsg character set conversion, Ralph Corderoy
Indexes:	[Date] [Thread] [Top] [All Lists]