RabbitMQ and HAProxy: a timeout issue

If you’re trying to setup a highly available RabbitMQ cluster using HAProxy, you may encounter a disconnection issue from your clients.

This problem is due to HAProxy having a timeout client (clitimeout is deprecated) setted for the default client timeout parameter. If a connection is considered idle for more than timeout client (ms), the connection is dropped by HAProxy.

RabbitMQ clients use persistent connections to a broker, which never timeout. See the problem here? If your RabbitMQ client is inactive for a period of time, HAProxy will automatically close the connection.

So how do we solve the problem ? I’ve seen that HAProxy got a clitcpka option which enable the sending of TCP keepalive packets on the client side.

Let’s use it !

But it’s not solving the problem, disconnection issue are still there. Damn.

After reading a discuss about RabbitMQ and HAProxy on the RabbitMQ mailing list, Tim Watson pointed out that:

[…]the exact behaviour of tcp keep-alive is determined by the underlying OS/Kernel configuration[…]

On Ubuntu 14.04, in the tcp man, you can see that the default value for the tcp_keepalive_time parameter is set to 2 hours. This parameter defines the time a connection needs to be idle before TCP begins sending out keep-alive packets.

You can also verify it by using the following command:

$ cat /proc/sys/net/ipv4/tcp_keepalive_time
7200

OK ! Let’s raise thetimeout client value in our HAProxy configuration for AQMP, 3 hours should be good. And that’s it ! No more disconnection issues 🙂

Here is a sample HAProxy configuration:

global
        log 127.0.0.1   local1
        maxconn 4096
        #chroot /usr/share/haproxy
        user haproxy
        group haproxy
        daemon
        #debug
        #quiet

defaults
        log     global
        mode    tcp
        option  tcplog
        retries 3
        option redispatch
        maxconn 2000
        timeout connect 5000
        timeout client 50000
        timeout server 50000

listen  stats :1936
        mode http
        stats enable
        stats hide-version
        stats realm Haproxy\ Statistics
        stats uri /

listen aqmp_front :5672
        mode            tcp
        balance         roundrobin
        timeout client  3h
        timeout server  3h
        option          clitcpka
        server          aqmp-1 rabbitmq1.domain:5672  check inter 5s rise 2 fall 3
        server          aqmp-2 rabbitmq2.domain:5672  check inter 5s rise 2 fall 3

Enjoy your highly available RabbitMQ cluster !

I think there may be another solution to this problem by using the heartbeat feature of RabbitMQ, see more about that here: https://www.rabbitmq.com/reliability.html

Advertisements

27 thoughts on “RabbitMQ and HAProxy: a timeout issue

      1. I found a problem. Clients don’t use a keepalive connections. Each connection send TCP SYN/ACK after AMQP Connection.Close. And RabbitMQ send a response RST/ACK, and Haproxy don’t like this response.

  1. clitimeout and srvtimeout are deprecated now. From now on it’s “timeout client” and “timeout server”.

    Thanks for your post!

  2. I wanted to mention that you can also set “timeout client/server” directly in the “listen” section. This way the high timeouts only affect the AQMP Port and maybe not other normal HTTP load balancers.

    We have about 40 load-balanced HTTP services in HAProxy and will now add AQMP as another one but we only want the high timeouts set there, so we add timeout client and server directly in the listen section which is good. The documentation says, that is allowed!

    1. Indeed, my example was only showing a HAProxy configuration dedicated to AQMP balancing.

      I’ll update my article because I think that your solution is more specific, thanks mate !

      Note: AQMP uses TCP load-balancing, not HTTP.

  3. At statistics I get Error Resp column filled (tool tip shows “Connection resets during transfers: 2 client, 15 server”). Seems number increase after every client disconnect. Have you experience it? Any idea why this errors are logged?

    1. I’ve been inspecting the stats page after you comment and I’m also seeing a lot of Resp errors. I do not have the time to investigate it for the moment but will surely do it later.

      If you find something on your side, please tell me 😉

      1. I also face the same issue . The number is increase after every client disconnect and seeing lot of Resp errors. Did any of you found any solution?

  4. I fail to understand how increasing the value of tcp_keepalive_time is solving the issue here. From what I understand, tcp_keepalive_time defines when the linux kernel sends a duplicate ACK packet after a socket has been idle for some time. How is increasing that value fixing the problem?

    1. Hello,

      The thing is RabbitMQ is not establishing a connection/message, it uses the same one to send multiple messages. So, if that connection stays up for more than timeout_client seconds (defaults to 50s) without creating any activity, it will be considered idle for the HAProxy server and it will drop it.

      From the man of tcp, tcp_keepalive_time is:

      The number of seconds a connection needs to be idle before TCP begins sending out keep-alive probes.

      So if you want your connection to not be considered idle by HAProxy, you could increase the timeout or decrease the tcp_keepalive_time value (not recommended as it could impact other processes using tcp).

  5. Helps a lot if you ever see this from your PHP consumer:

    Uncaught exception ‘PhpAmqpLib\Exception\AMQPIOException’ with message ‘Error reading data. Received 0 instead of expected 7 bytes’ in …/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Wire/IO/StreamIO.php:161

  6. HI, as far as I know tcp keepalive packets is not “visible” at application level, How can it work in conjenction to client timeout parameter? HAProxy doens’t receive anything when keepalive probe is sent/received

  7. Thank you for reminding us the possiblity of using heartbeat of rabbitmq, and I tried. According to the document of rabbitmq, rabbitmq 3.0 will send heartbeat to client by itself per 580 seconds. The pocket of heartbeat will make haproxy consider the connection is still alive without closing connection. So we may not need “option clitcpka”. Just set timeout client and server higher than 580 seconds, the problem will be solved.

    https://www.rabbitmq.com/heartbeats.html

  8. I found this page extremely helpful in setting up a local HAProxy rig in front of our RabbitMQs, but when it went into heavy dev testing I was finding that HAProxy reported every single session ended prematurely (connection resets trailed sessions only by the current number of open sessions.) This caused problems for the upstream application as well. I saw all the comments here and had no success with heartbeats or other tweaks, but then I found this page:

    http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-January/017138.html

    Basically, the default TCP listener configuration for RabbitMQ doesn’t enable TCP keepalives. Enabling that last line (which requires defining the entirety of the TCP listener configuration) will stop RabbitMQ from ending each session with a ‘R’eset packet, and keep HAproxy from logging all sessions as “connection reset by server.”

      1. deploying with listen blocks, use ‘option tcpka’. Anyone using fontend/backend style configuration should use clitcpka and srvtcpka

  9. Hi, thanks for this post. I have an all-docker system. I brought up a rmq cluster with two nodes rabbit1 and rabbit2 on the same host. My rmq containers connect via a username and password in the amqp uri. The problem is haproxy receives Layer 4 “Connection refused” issue as it attempts to connect. I am guessing as haproxy operates at TCP level it has no way of providing credential as it is not uusing amqp protocol, hence brokers refuse connection.

    `
    root@ubuntu:~/proj/haproxy# docker run -it –rm –name haproxy1 –link proj_rabbit1_1:rabbit1 –link proj_rabbit2_1:rabbit2 haproxy:latest
    [WARNING] 053/223835 (1) : Server aqmp_front/rabbit1 is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
    [WARNING] 053/223837 (1) : Server aqmp_front/rabbit2 is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
    [ALERT] 053/223837 (1) : proxy ‘aqmp_front’ has no server available!
    `

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s