Hello,
I believe I finally tracked and fixed the long-standing "server troubles
with SSL cause fetchmail to hang with 100% CPU" bug.
This mail is somewhat long, I'm sectioning it to make it more concise.
This mail has six sections:
1 non-technical problem description - for users
2,3,4 technical problem description - for hackers
5 workaround for users - for users
6 patch to fix the problems - for Eric and people who want to go
figure if this fixes their problems
1. PROBLEM:
The chat is POP3/SSL: USER, +OK, PASS, -ERR, QUIT -> hang
The server drops the connection right after -ERR, so there's no
chance the QUIT can get through.
fetchmail then hangs with 100% CPU. I recall that I had these hangs
in a different place when the server died earlier after accepting the
connection.
2. DIAGNOSIS: (you can safely skip to section 5 below unless you
understand fetchmail's internals, without missing important
information.)
strace shows that fetchnews loops on these syscalls:
getpid() = 2826
read(3, "", 5) = 0
read(3, "", 5) = 0
lsof says about fd #3:
fetchmail 2826 emma 3u sock 0,0 129440 can't identify protocol
At that time, the IPv4 connection has gone away.
3. REPRODUCTION: I have a test program (Perl) that allows me - when used
with stunnel - to reproduce the problem:
------------------------------------------------------------------------
#! /usr/bin/perl
# NAME THIS PROGRAM fakepop3.pl ####################################
$| = 1;
for(;;) {
print "+OK POP3 Fakedaemon ready.\r\n";
if (<> !~ /^user\s/i) { print "-ERR\r\n"; next; }
print "+OK\r\n";
if (<> !~ /^pass\s/i) { print "-ERR syntax\r\n"; next; }
print "-ERR Datenbankfehler\r\n"; close(STDOUT); last;
}
------------------------------------------------------------------------
Save this script as fakepop3.pl, create some fake /tmp/server.pem and
launch the server with (maybe stunnel can work without certificate,
didn't check):
/usr/sbin/stunnel -d 1995 -l fakepop3.pl -p /tmp/server.pem
Then add this to .fetchmailrc (you may need to comment out other "poll
localhost" sections)
poll localhost with proto pop3 uidl port 1995
user x password y is junkme here ssl
Then run fetchmail -Nvd0 localhost.
4. BUGS FOUND: I tracked this down to some bugs in socket.c.
#1 socket.c uses ERR_get_error rather than SSL_get_error, however,
SSL_read/SSL_peek are supposed to use SSL_get_error. Not sure if that
matters, I've fixed this anyhow.
#2 socket.c does not call SSL_get_error when SSL_peek or SSL_read
return 0, but it must call SSL_get_error in that case to distinguish
between "no data" and "error" conditions. (Talk about a strange API.)
5. WORKAROUND: For fetchmail users, a workaround is to install stunnel
and reconfigure .fetchmailrc to use something similar to this:
plugin "/usr/sbin/stunnel -c -r pop3.web.de:995"
Put this after "poll" and before "user". Remember to adjust your
server and port. Mind you, stunnel had some issues, be sure to get
the latest version (at least 3.22) from http://stunnel.mirt.net/
(official) or http://www.stunnel.org/ (trustworthy inofficial).
6. PATCH/FIX: Here's a patch against fetchmail 5.9.7 that fixes the
problem for me. Tested against OpenSSL 0.9.6something on SuSE Linux
7.3, not sure about older SSL versions. The patch is (C) 2002 by
Matthias Andree, and is put under a triple LGPL/GPL/MIT
license. Choose what suits your needs.
This patch also applies against fetchmail 5.9.0, at least when
applied with GNU patch, it will report offsets of consistently 33
lines. A patch against fetchmail 5.9.0 can be found at
http://studserver.uni-dortmund.de/~su0186/fetchmail-5.9.0.patch --
but it is untested, I only tested fetchmail 5.9.7.
Eric, please consider the patch for 5.9.8, and if you apply it,
please release another 5.9.x to get a wide test base before jumping
to 5.10.0 or 6.0.0, as I fear build woes with older OpenSSL/SSLEAY
versions.
--- ./socket.c.orig Sun Feb 10 13:54:53 2002
+++ ./socket.c Sun Feb 10 14:20:29 2002
@@ -563,13 +563,14 @@
later change the behavior of SSL_peek
to "fix" this problem... :-( */
if ((n = SSL_peek(ssl, bp, len)) < 0) {
+ (void)SSL_get_error(ssl, n);
return(-1);
}
if( 0 == n ) {
/* SSL_peek says no data... Does he mean no data
or did the connection blow up? If we got an error
then bail! */
- if( 0 != ( n = ERR_get_error() ) ) {
+ if( 0 != ( n = SSL_get_error(ssl, n) ) ) {
return -1;
}
/* We didn't get an error so read at least one
@@ -581,8 +582,13 @@
newline = NULL;
} else if ((newline = memchr(bp, '\n', n)) != NULL)
n = newline - bp + 1;
- if ((n = SSL_read(ssl, bp, n)) == -1) {
- return(-1);
+ /* Matthias Andree: SSL_read can return 0, in that case
+ * we must cal SSL_get_error to figure if there was
+ * an error or just a "no data" condition */
+ if ((n = SSL_read(ssl, bp, n)) <= 0) {
+ if ((n = SSL_get_error(ssl, n))) {
+ return(-1);
+ }
}
/* Check for case where our single character turned out to
* be a newline... (It wasn't going to get caught by
@@ -635,16 +641,20 @@
#ifdef SSL_ENABLE
if( NULL != ( ssl = SSLGetContext( sock ) ) ) {
n = SSL_peek(ssl, &ch, 1);
+ if (n < 0) {
+ (void)SSL_get_error(ssl, n);
+ return -1;
+ }
if( 0 == n ) {
/* This code really needs to implement a "hold back"
* to simulate a functioning SSL_peek()... sigh...
* Has to be coordinated with the read code above.
* Next on the list todo... */
- /* SSL_peek says no data... Does he mean no data
+ /* SSL_peek says 0... Does that mean no data
or did the connection blow up? If we got an error
then bail! */
- if( 0 != ( n = ERR_get_error() ) ) {
+ if( 0 != ( n = SSL_get_error(ssl, n) ) ) {
return -1;
}
--
Matthias Andree
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin