奇怪的time_wait——http客户端连接经常RST问题分析

前两天aiming说有个内部访问php接口的服务器经常会收到rst。起初就怀疑是time_wait之类的东西作怪，但是一直想着会是client端time_wait的问题，于是陷入了死胡同。。。刚才突然想起貌似client端去请求server端http接口，主动断开的一方应该是server端，是server端进入了time_wait。顺着这个思路，经过试验，终于发现了问题的所在。

实验中用到的工具:tcpdump，curl，hping（你需要对他们进行一定的了解才能理解下面的内容）
实验中的client:sd-pt-cs00.bj(10.180.2.251)
实验中的server:sd-pt-xman00.bj(10.180.2.243)

我们首先模拟client请求server端http
[root@client]# curl "sd-pt-xman00.bj" #得到若干内容

同一时间在server端抓到的包如下
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 00:04:45.820164 IP (tos 0x0, ttl 64, id 10936, offset 0, flags [DF], proto TCP (6), length 60) sd-pt-cs00.bj.57979 > sd-pt-xman00.bj.http: Flags [S], cksum 0xc4ea (correct), seq tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 00:04:45.820164 IP (tos 0x0, ttl 64, id 10936, offset 0, flags [DF], proto TCP (6), length 60) sd-pt-cs00.bj.57979 > sd-pt-xman00.bj.http: Flags [S], cksum 0xc4ea (correct), seq 2862922835, win 14600, options [mss 1460,sackOK,TS val 1309251307 ecr 0,nop,wscale 6], length 0 00:04:45.820217 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) sd-pt-xman00.bj.http > sd-pt-cs00.bj.57979: Flags [S.], cksum 0xd192 (correct), seq 3280914721, ack 2862922836, win 14480, options [mss 1460,sackOK,TS val 4054807901 ecr 1309251307,nop,wscale 7], length 0 00:04:45.820217 IP (tos 0x0, ttl 64, id 10937, offset 0, flags [DF], proto TCP (6), length 52) #此处省略若干行 00:04:45.821854 IP (tos 0x0, ttl 64, id 10942, offset 0, flags [DF], proto TCP (6), length 52) sd-pt-cs00.bj.57979 > sd-pt-xman00.bj.http: Flags [F.], cksum 0x2ad1 (correct), seq 168, ack 3080, win 364, options [nop,nop,TS val 1309251309 ecr 4054807902], length 0 00:04:45.822037 IP (tos 0x0, ttl 64, id 48260, offset 0, flags [DF], proto TCP (6), length 52) sd-pt-xman00.bj.http > sd-pt-cs00.bj.57979: Flags [.], cksum 0x2bc2 (correct), seq 3080, ack 169, win 122, options [nop,nop,TS val 4054807903 ecr 1309251309], length 0

可以算出来，最后server端针对client的fin包的ack值为169+2862922835=2862923004

接下来查看server端对这次链接的time_wait

#netstat -ant|grep 80 tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN tcp 0 0 10.180.2.243:22 10.180.2.251:38216 ESTABLISHED tcp 0 0 10.180.2.243:80 10.180.2.251:57979 TIME_WAIT

可以看到这次连接为 10.180.2.251:57979 -> 10.180.2.243:80

然后在client完全模拟上次curl请求
# hping3 10.180.2.243 -s 57979 -p 80 HPING 10.180.2.243 (eth0 10.180.2.243): NO FLAGS are set, 40 headers + 0 data bytes len=52 ip=10.180.2.243 ttl=64 DF id=0 sport=80 flags=A seq=0 win=122 rtt=0.1 ms

同时在server端抓包
#tcpdump -i eth0 -p tcp port 80 -vvv tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 00:05:07.365304 IP (tos 0x0, ttl 64, id 33134, offset 0, flags [none], proto TCP (6), length 40) sd-pt-cs00.bj.57979 > sd-pt-xman00.bj.http: Flags [none], cksum 0xd5fc (correct), seq 1790161117, win 512, length 0 00:05:07.365334 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52) sd-pt-xman00.bj.http > sd-pt-cs00.bj.57979: Flags [.], cksum 0xd79a (correct), seq 3280917801, ack 2862923004, win 122, options [nop,nop,TS val 4054829446 ecr 1309251309], length 0 00:05:07.365334 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40) sd-pt-cs00.bj.57979 > sd-pt-xman00.bj.http: Flags [R], cksum 0x4a1e (correct), seq 2862923004, win 0, length 0

这时候，aiming反映的问题重现了，client端发起连接，但三次握手都没有完成就返回了R（RST）标记的包。
观察server端返回的ack，2862923004，和client端发送的syn包seq不吻合，但不正是前一次curl请求连接断开时服务器端返回给客户端最后一个ack的值吗。。哈哈，问题似乎明朗了

当client的对server的请求QPS较大的时候，可能会发生之前的链接在server端状态还是time_wait，client又用相同的套接字去请求server端，这个时候，由于上述情况server返回了错误的ack，连接就会被client RST。

如：
client总是向同一server发送类似的http的短连接请求。client的local port有50000个，服务器端的time_wait是120s的时候，理论上，如果QPS大于50000/120 约等于417，就会经常出现类似的连接RST。

解决方法：
1.增加client端local port的数量。。。这个好像效果并不会太好，因为localport本来就不小了，再大也大不过65535吧？
2.减小server端time_wait（2msl）的时间，这个标准内核是没有暴露time_wait时间修改的接口（sysctl）,不过还好昨天让yubo大师把我们的定制内核开放了这个接口，明天就去更新aiming业务server端的内核然后修改time_wait到很小（比如5s）来尝试解决这个问题。
3.客户端多绑定几个IP,并选择不同的IP去访问服务器端
4.服务器端多几个IP，客户端分别访问
4.把服务器端的tw_bucket缩小

————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
评论的朋友回复的方法，感觉很棒，有机会试一下
3.3.20. tcp_rfc1337

The tcp_rfc1337 variable implements the solution found in RFC 1337 – TIME-WAIT Assassination Hazards in TCP to TIME-WAIT Assassination. In short, the problem is that old duplicate packets may interfer with new connections, and lead to three different problems. The first one is that old duplicate data may be accepted erroneously in new connections, leading to the sent data becoming corrupt. The second problem is that connections may become desynchronized and get into an ACK loop because of old duplicate packets entering new connections, which will become desynchronized. The third and last problem is that old duplicate packets may enter newly established connections erroneously and kill the new connection.

There are three possible solutions to this according to the mentioned RFC, however, one solution is only partial and not a long term solution, while the last requires heavy modifications of the TCP protocol, and is hence not a viable option.

The final solution that the linux kernel implements with this option, is to simply ignore RST packets sent to a socket while it is in the TIME-WAIT state. In use together with 2 minute Maximum Segment Life (MSL), this should eliminate all three problems discussed in RFC 1337.

#sysctl -a|grep rfc
net.ipv4.tcp_rfc1337 = 0

我修改了net.ipv4.tcp_rfc1337=1，发现效果只是在服务器端（time_wait端）接收到相同socket的时候不会将这个time_wait删除，会持续保留2msl时间。
而当tcp_rfc1337=0时，服务器端接收到相同的socket syn请求会返回上次连接time_wait状态的fin ack包，并且删除这个time_wait状态的session。

因此这个感觉还是无法满足我们需求，我们的需求是相同的socket请求过来时，不会和在time_wait状态的相同的socket冲突，貌似缩短2msl还是最有效的。

12 Comments

wilbur

2013/07/30 at 3:33 下午

赞~！

jaseywang

2013/07/31 at 7:48 下午

多绑定几个 ip，port 就多了吧。

- siyu
  
  2013/08/01 at 11:34 上午
  
  客户端多绑定几个可以解决。
  或者把服务器端的tw_bucket缩小也可以解决。
  
netdc

2013/08/01 at 3:23 下午

为什么站点没有做RSS呢？

- laiwei
  
  2013/08/01 at 3:25 下午
  
  http://noops.me/?feed=rss2
  
  有哈：）
  上面这个连接你订阅看看
  
laneovcc

2013/08/29 at 4:54 下午

tcp_rfc1337

http://www.frozentux.net/ipsysctl-tutorial/ipsysctl-tutorial.html#AEN448

囧囧

- siyu
  
  2013/08/29 at 7:44 下午
  
  赞，学习了
  3.3.20. tcp_rfc1337
  
  The tcp_rfc1337 variable implements the solution found in RFC 1337 – TIME-WAIT Assassination Hazards in TCP to TIME-WAIT Assassination. In short, the problem is that old duplicate packets may interfer with new connections, and lead to three different problems. The first one is that old duplicate data may be accepted erroneously in new connections, leading to the sent data becoming corrupt. The second problem is that connections may become desynchronized and get into an ACK loop because of old duplicate packets entering new connections, which will become desynchronized. The third and last problem is that old duplicate packets may enter newly established connections erroneously and kill the new connection.
  
  There are three possible solutions to this according to the mentioned RFC, however, one solution is only partial and not a long term solution, while the last requires heavy modifications of the TCP protocol, and is hence not a viable option.
  
  The final solution that the linux kernel implements with this option, is to simply ignore RST packets sent to a socket while it is in the TIME-WAIT state. In use together with 2 minute Maximum Segment Life (MSL), this should eliminate all three problems discussed in RFC 1337.
  
- siyu
  
  2013/08/30 at 4:53 下午
  
  我修改了net.ipv4.tcp_rfc1337=1，发现效果只是在服务器端（time_wait端）接收到相同socket的时候不会将这个time_wait删除，会持续保留2msl时间。
  而当tcp_rfc1337=0时，服务器端接收到相同的socket syn请求会返回上次连接time_wait状态的fin ack包，并且删除这个time_wait状态的session。
  
  因此这个感觉还是无法满足我们需求，我们的需求是相同的socket请求过来时，不会和在time_wait状态的相同的socket冲突，貌似缩短2msl还是最有效的。
  
  - laneovcc
    
    2013/08/30 at 11:36 下午
    
    确实是，只能编译个内核了，悲了个剧
    
ZJ

2013/09/01 at 4:20 下午

你要找的是这个儿么？你关了tw recycling了么？
http://www.pagefault.info/?p=416

- siyu
  
  2013/09/02 at 4:53 下午
  
  您跟我说的情况貌似不同，之前看过类似问题
  
  我的这个情况其实server端并没有返回RST，而是返回了time_wait状态session的最后一个ACK包
  
刘晟

2013/10/17 at 1:58 上午

这个业务是短连接还是长连接？测试时 time_wait快速回收有开启吗？