Re: [Nmh-workers] bug in decode

I've been slow to adapt to the multibyte world, so I
tripped over a bug in decode_rfc2047():

 12  12/31 "Sears"            2013 New Year?s Deals! Start the yearr right

scan really did produce "yearr", wrongly.  valgrind noticed,
too.


I'm trying to understand the bug ... what exactly triggered it?  The
encoding on the Subject line was bad?  I'm just trying to understand
potential pitfalls so future code doesn't have it.


The encoding was correct.  The problem was due to improper
handling of an invalid character for the locale.

U+2019 was used for the apostrophe in "Year's".  With my
single-byte locale, iconv reported the first invalid byte.
decode_rfc2047() output the '?', moved on to the next character,
and continued conversion.

It keeps track of position in the input byte string ("start")
and the count of remaining bytes ("inbytes").  The problem was
that it initially advanced start to the next byte but didn't
decrement inbytes.  So it eventually fed iconv a byte of
garbage.  (The input was split into two strings, so that showed
up in the middle of scan's Subject.)

The fix was to decrement inbytes when (initially) advancing
start.  It already did that for non-UTF8 input.  So this took a
combination of UTF-8, a multibyte character, and a locale that
couldn't handle that character.

The root of all this is iconv's behavior that requires us to
skip past the invalid character.  Looking at it now, I wonder if
we can do better than the current special handling for UTF-8?
It's the "fromutf8" block below:

    while (inbytes) {
        if (iconv(cd, &start, &inbytes, &saveq, &savedstlen) ==
                (size_t)-1) {
            if (errno != EILSEQ) break;
            /* character couldn't be converted. we output a `?'
             * and try to carry on which won't work if              
             * either encoding was stateful */
            iconv (cd, 0, 0, &saveq, &savedstlen);
            if (!savedstlen)
                break;
            *saveq++ = '?';
            savedstlen--;
            if (!savedstlen)
                break;
            /* skip to next input character */
            if (fromutf8) {
                for (++start, --inbytes;
                     start < q  &&  (*start & 192) == 128;
                     ++start, --inbytes)
                    continue;
            } else
                start++, inbytes--;
            if (start >= q)
                break;
        }
    }

That's the only special handling of UTF-8 in decode_rfc2047().
And decode_rfc2047() is our only caller of iconv(), and it's
just in this one place.

David

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread]	Current Thread	[Next in Thread>
[Nmh-workers] bug in decode_rfc2047(), David Levine Re: [Nmh-workers] bug in decode_rfc2047(), Ken Hornstein Re: [Nmh-workers] bug in decode_rfc2047(), David Levine <= Re: [Nmh-workers] bug in decode_rfc2047(), Ken Hornstein Re: [Nmh-workers] bug in decode_rfc2047(), Valdis . Kletnieks Re: [Nmh-workers] bug in decode_rfc2047(), David Levine Re: [Nmh-workers] bug in decode_rfc2047(), Ken Hornstein

Previous by Date:	Re: [Nmh-workers] Garbage collection, Michael Richardson
Next by Date:	Re: [Nmh-workers] Garbage collection, Lyndon Nerenberg
Previous by Thread:	Re: [Nmh-workers] bug in decode_rfc2047(), Ken Hornstein
Next by Thread:	Re: [Nmh-workers] bug in decode_rfc2047(), Ken Hornstein
Indexes:	[Date] [Thread] [Top] [All Lists]

Re: [Nmh-workers] bug in decode_rfc2047()