I assume vim(1) will read up to a certain amount until it either makes up its mind or assumes the default.
That makes sense.
Try this to remove the boring ASCII bytes and see what's left. tr -d ' -~' <bad | env LC_ALL=C grep -n .
Done. I've attached an 11 Kb PDF file to show the results, but I can describe them here as follows: - The first 39 output lines show the tab characters from the message headers. - Lines 94, 96, 98, 100, 102, 104, 108 and 110 all show accented characters, which appear out of context to be exactly what should appear in the message. This is absolutely consistent with the file being properly encoded in UTF-8. - Lines 289, 291, 293, 295, 300, 304, 308 and 310 all show sequences of (nothing but) ‘�’ glyphs; in each case the number of these glyphs matches the number of valid characters in the lines 94-110 range. - For reference, lines in the original file are divided as follows: - Lines 1-83 are the message headers - Lines 85-110 are the text/plain portion, with Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 - Lines 112-336 are the text/html portion, with Content-Transfer-Encoding: 8bit Content-Type: text/html; charset=iso-8859-1 Mime-Version: 1.0 ...so it seems that tr is reporting exactly what we'd expect to see.
https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character describes ‘�’ and it's being seen above because cut(1) is cutting bytes and the ‘108:’ at the start of the line has shifted the 68/69 cut-off point to part-way through the UTF-8 for a single code point AKA rune.
For me, this falls into the category of "things that are perfectly obvious, but only after they've been explained". Thank you for explaining it.
Try sh LC_ALL=C; export LC_ALL locale perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
Done, and I just learned something interesting. First, the output looks like this: sh-5.1$ LC_ALL=C; export LC_ALL sh-5.1$ locale LANG=en_CA.UTF-8 LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_PAPER="C" LC_NAME="C" LC_ADDRESS="C" LC_TELEPHONE="C" LC_MEASUREMENT="C" LC_IDENTIFICATION="C" LC_ALL=C sh-5.1$ perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet Veuillez ne pas r<c3><a9>pondre au pr<c3><a9>sent courriel. Il a <c3><a9>t<c3><a9> g<c3><a9>n<c3><a9>r<c3><a9> Second, the problem with the original command appearing to hang turns out to be an interaction between bash and xterm's pasting mechanism(!). I'm accustomed to pasting a command line by triple-clicking to select the whole line, then middle-clicking to paste it. That's how xterm has worked since I first started using it <mumble> years ago. ...and it still works exactly this way, and the line gets pasted just as I expect, in tcsh. ...but in bash, although the line gets pasted, the newline at the end of it somehow doesn't. When LC_ALL=C perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet originally seemed to hang, in fact it was just waiting for me to press the Enter key! I still don't know why this is happening, but at least I'm comforted by the fact that my bash binary isn't totally broken. :-/
Beware that invoking bash(1) as ‘sh’ is not the same as running ‘bash’.
I did know that, but thank you for mentioning it just in case.
Might not make a difference in this case, but in general it's better to run whichever is desired.
Right, but in this case sh was what was desired. As I understand it, when invoked that way bash behaves closer to a real Bourne shell than when involved as bash.
I propose to forget this particular clupea harengus of the crimson variety unless you find it interesting in and of itself.It is odd. And odd might affect other things, including to do with nmh. :-)
Odd indeed, but apparently only when used interactively with xterm, so nmh is unlikely to be affected. - Steven -- ___________________________________________________________________________ Steven Winikoff | Montreal, QC, Canada | "The reward of a thing well smw@smwonline.ca | done is to have done it." http://smwonline.ca | | - Emerson
tr_output.pdf
Description: tr_output.pdf
Previous by Date: | Re: mhfixmsg character set conversion, Steven Winikoff |
---|---|
Next by Date: | Re: mhfixmsg character set conversion, Steven Winikoff |
Previous by Thread: | Re: mhfixmsg character set conversion, Ralph Corderoy |
Next by Thread: | Re: mhfixmsg character set conversion, Robert Elz |
Indexes: | [Date] [Thread] [Top] [All Lists] |