Weird network behavior

Recently, we were asked to debug connectivity problems with an internal service, where about 1% of requests were resulting in "Connection reset by peer" errors.  There were no reported errors from the server itself, so we decided to take a closer look at the traffic on the wire.

First, we wanted to verify the ratio of resets occurring on the client side, so I ngrepped for all the TCP flags on the service port.
ngrep -d any port $SERVER_PORT | grep "\[" | grep -e "^T" | tr -s ' ' | cut -d ' ' -f5 | head -100000 | sort | uniq -c
  42188 [A]
  14041 [AF]
  29212 [AP]
    261 [AR]
  14298 [AS]

Indeed, the number of resets on the client is higher than we'd expect.  The same thing on the server side.

ngrep port $SERVER_PORT | grep "\[" | grep -e "^T" | tr -s ' ' | cut -d ' ' -f5 | head -100000 | sort | uniq -c
  37912 [A]
  15269 [AF]
  31238 [AP]
     76 [AR]
     34 [R]
  15471 [S]

The weird thing here in this sample was both sides report most of the resets are being issued from the other side of the connection.



Communication between the client and server is mediated by an F5 proxy, configured to use a snat table... so the F5 must be making the decision to reset the connections.   A coworker verified that loading a server directly didn't result in connection resets, but going through the F5 does.  I wanted to look at some example TCP conversations on the server to see what happens on the connections when a there is a reset.   I didn't really want to manually identify an example weeding through the traffic, so I dumped a few minutes of ngrep results into a file, and then ran the following to pick out the sequence of flags in the TCP conversation...  This just grep for lines containing a reset, and then picks out the snat ip and port, then searches the file for the TCP conversations containing that tuple.

grep "\[AR\]" $TRAFFIC_FILE | cut -d ' ' -f4 | xargs -I% sh -c 'grep "%" traffic; echo ""' | head -100
T 2017/04/13 01:22:11.436659 $LB_IP:$LP_PORT -> $SERVER_IP:$SERVER_PORT [S]
T 2017/04/13 01:22:11.437033 $LB_IP:$LP_PORT -> $SERVER_IP:$SERVER_PORT [A]
T 2017/04/13 01:22:11.437057 $LB_IP:$LP_PORT -> $SERVER_IP:$SERVER_PORT [AP]
T 2017/04/13 01:22:11.440700 $SERVER_IP:$SERVER_PORT -> $LB_IP:$LP_PORT [AP]
T 2017/04/13 01:22:11.440968 $LB_IP:$LP_PORT -> $SERVER_IP:$SERVER_PORT [A]
T 2017/04/13 01:22:11.441646 $LB_IP:$LP_PORT -> $SERVER_IP:$SERVER_PORT [AF]
T 2017/04/13 01:22:17.953067 $LB_IP:$LP_PORT -> $SERVER_IP:$SERVER_PORT [S]
T 2017/04/13 01:22:18.952770 $LB_IP:$LP_PORT -> $SERVER_IP:$SERVER_PORT [S]
T 2017/04/13 01:22:19.952625 $LB_IP:$LP_PORT -> $SERVER_IP:$SERVER_PORT [S]
T 2017/04/13 01:22:20.952472 $LB_IP:$LP_PORT -> $SERVER_IP:$SERVER_PORT [AR]

This seems weird to me, we see a conversation finish, followed by three SYNs from the LB, then an ACKRST... the timestamps appear to be in one second increments, so it looks like the resets are a result of a timeout on the F5.  It looks like the F5 has picked an IP and ephemeral port from a connection that has been recently closed, and is probably in a TIME_WAIT state.  We verified this to be the case shortly after seeing these kind of connection resets with netstat.  The F5 wasn't responsible for the original connection close, so presumably that IP:PORT pair is immediately available for reuse... and the F5 looks like it chooses random IP:PORT pairs when establishing a connection, rather than some kind of round robin scheme.  The server should actually have been configured to reuse connections in TIME_WAIT

sudo sysctl -a | grep tcp_tw_
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 1


This actually presumes that the sequence numbers between connection attempts are significantly further apart than the TCP window... but we aren't seeing this information with ngrep.  So repeated this type of capture with tcpdump.

  1. 17:15:00.320997 IP $LB_IP:$LP_PORT > $SERVER_IP:$SERVER_PORT: Flags [S], seq 1640887617, win 14600, options [mss 1460], length 0
  2. 17:15:00.321014 IP $SERVER_IP:$SERVER_PORT > $LB_IP:$LP_PORT: Flags [S.], seq 2030898605, ack 1640887618, win 14600, options [mss 1460], length 0
  3. 17:15:00.321464 IP $LB_IP:$LP_PORT > $SERVER_IP:$SERVER_PORT: Flags [.], ack 1, win 14600, length 0
  4. 17:15:00.321485 IP $LB_IP:$LP_PORT > $SERVER_IP:$SERVER_PORT: Flags [P.], seq 1:321, ack 1, win 14600, length 320
  5. 17:15:00.321494 IP $SERVER_IP:$SERVER_PORT > $LB_IP:$LP_PORT: Flags [.], ack 321, win 15544, length 0
  6. 17:15:00.325898 IP $SERVER_IP:$SERVER_PORT > $LB_IP:$LP_PORT: Flags [P.], seq 1:271, ack 321, win 15544, length 270
  7. 17:15:00.325969 IP $SERVER_IP:$SERVER_PORT > $LB_IP:$LP_PORT: Flags [F.], seq 271, ack 321, win 15544, length 0
  8. 17:15:00.326334 IP $LB_IP:$LP_PORT > $SERVER_IP:$SERVER_PORT: Flags [.], ack 271, win 15544, length 0
  9. 17:15:00.328226 IP $LB_IP:$LP_PORT > $SERVER_IP:$SERVER_PORT: Flags [F.], seq 321, ack 272, win 15544, length 0
  10. 17:15:00.328266 IP $SERVER_IP:$SERVER_PORT > $LB_IP:$LP_PORT: Flags [.], ack 322, win 15544, length 0
  11. 17:15:55.915648 IP $LB_IP:$LP_PORT > $SERVER_IP:$SERVER_PORT: Flags [S], seq 1563951501, win 64240, options [mss 1460], length 0
  12. 17:15:55.915670 IP $SERVER_IP:$SERVER_PORT > $LB_IP:$LP_PORT: Flags [.], ack 322, win 15544, length 0
  13. 17:15:56.915125 IP $LB_IP:$LP_PORT > $SERVER_IP:$SERVER_PORT: Flags [S], seq 1563951501, win 64240, options [mss 1460], length 0
  14. 17:15:56.915172 IP $SERVER_IP:$SERVER_PORT > $LB_IP:$LP_PORT: Flags [.], ack 322, win 15544, length 0
  15. 17:15:57.914873 IP $LB_IP:$LP_PORT > $SERVER_IP:$SERVER_PORT: Flags [S], seq 1563951501, win 64240, options [mss 1460], length 0
  16. 17:15:57.914881 IP $SERVER_IP:$SERVER_PORT > $LB_IP:$LP_PORT: Flags [.], ack 322, win 15544, length 0
  17. 17:15:58.916339 IP $LB_IP:$LP_PORT > $SERVER_IP:$SERVER_PORT: Flags [R.], seq 4218031181, ack 782496990, win 0, length 0

So we see a lot more with tcpdump than with ngrep.  There are clearly two conversations that happen on this connection, and the F5 looks like it does the right thing, bumping up the sequence number well beyond whatever the tcp window might be.  But our server is sending acks with the older sequence numbers when the load balancer sends the new SYNs.  So somehow, it looks as if the tw_reuse settings aren't working.  One other note of interest here is the F5 appears to strip timestamps out of the TCP packets... if these settings rely on that, it could explain why the connection cannot be reused.  I seem to remember when perusing the documentation, only the tw_recycle would need timestamps, and that is disabled.  So why connections in TIME_WAIT can't be used remains a mystery.

The other interesting thing, looking back at the ratio of TCP flags posted at the beginning, we see nearly as many FINACKs as ACKPUSH... which when you look at the rest of the conversation, it becomes clear that we aren't reusing the connections at all between requests.  These are supposed to be using persistent connections, so why does the server close the connection at all?

We looked back through the splunk logs to see when we started seeing higher connection errors, and looked at the code... nothing stuck out as to why we were no longer using persistent connections.  One of our devs discovered that when we upgraded our Jenkins servers, we migrated the build process from Vagrant to Docker, where Vagrant explicitly specified use of Go version 1.4, whereas the Docker process specifies only the minor version, which was 1.4.3.  In this version, when you close a request body, it caused the connection to close.  Another dev found the code causing the problem, found here.

We updated our version of Go to 1.8, and with persistent connections, it lowered our error rate an order of magnitude.  We still haven't solved the underlying reason for our server not being able to properly reuse a connection in TIME_WAIT, if anyone has any ideas, I would appreciate it.

Comments

Popular Posts