I have been haunted by this weird TCP spurious retransmissions and TCP DUP ACK issue since past 1 month – It almost started/I’ve noticed on November last week. Our production FTP server is a Red Lion device See here sitting in our manufacturing site, whereas our source servers are hosted on Hyper-V clusters. This setup has no Firewalls; only Cisco Nexus Switches 3064 & 3048 Models – that’s 3 3064’s and 2 3048 models connected in a HA model. Our Hyper-V clusters are connected to Cisco 3064 Switches in HA model; 2 Nic cables pulled from each VM Host to 2 3064 Switches for HA. Red Lion – FTP/HTTP device has been attached to 3048 model. These 3064’s are connected to the 3048 Switches directly – no firewalls.
STP is configured properly and running A-okay. Other than Red Lion device, I was able to route traffic as desired and can reach data transfer rates at 250 MB/s. But if this same Red Lion device is moved and connected to a different network that’s having Cisco Catalyst switches, this Device is working fine. No retransmissions issue.
There are a lot of packet retransmissions happening just before the FTP application failing with error – BTW, I am using Filezilla client to transfer data to the FTP Box. Same is the case when browsing the FTP/HTTP site hosted on the Red Lion box via IE from my machines.
Wireshark Analysis
I’ve analysed the network connection between these servers in question and noticed that there are a lot of packet retransmissions happening. TCP “RST” (RESET), “Spurious Retransmissions” (Source Retransmitted the packet even though the DEST ACK; assuming the DEST hasn’t ACK) are noticed in high numbers. This is not the case when I tried to capture traffic between the other sources.
TCP RST couldn’t be considered as the issue normally because this happens after every session closure. But in our case the packet retransmissions and failing communication are resetting the RPC port communication and thus these messages are seen. So obviously, in both success and failure cases we will see this kind of messages.
TCP Segment Length
I have noticed that the Maximum Segment Size; MSS of the destination server – Redlion box is “1280” and the source server is “1460”. Pinging with 1460 without fragmentation to the destination server which has 1280 MSS value is responding fine; data that remote server responds with has same data length size – “data.len>1460” filter applied shows that ICMP data of 1460 is transmittable both ways. Both the source and destination servers acknowledged to communicate using 1280 MSS value as they should be per application protocols standards; verified this as per “tcp.len>1200” filter applied and could see traffic generated has no TCP segment length that is using higher segment size than 1280 size in the application communications and thus eliminating the possible MSS size issue for packet retransmissions.
Port Query Results
ICMP packets are fine, they don’t have any issues. Only FTP/HTTP traffic is getting affected. This means no issues until Network layer, but with Application/session layer the traffic is getting worse. And at times the Portqry too failing with Filtered messages on port 21 from Source to destination FTP box.
Right now I am doubting the Speed/duplex settings on these switches and VM Hosts. Our VM Hosts are 10G capable NICs and Switches too. It is hard-coded in Nexus switches regarding speed at VM Hosts interface, so technically switches are controlling the speed, so I got nothing to do on VM Hosts speed/duplex settings; anything I want to modify is left with Nexus switch. End device Red Lion FTP box is only 100 MB Capable. Cannot blame if source talking at full 10 Gig speed and end device is failing to respond with same speed. Because the normal SYNC, ACK communication too getting affected with the TCP retransmissions; at this same time, I cannot assume this couldn’t be the reason. It still needs analysis to rule out things.
Worked with Cisco and they say Nexus switches don’t support buffering, so 10 Gig source and 100 MB destination don’t work in the nexus environment. Buffering is not capable they say in Nexus switches. An alternative they propose to fix is to update the IOS on these Nexus switches; but that’s tentative solution.
—— Update on 23rd Jan 2016—–
<<We’ve updated the Nexus IOS version to the latest, yet we see the same issues. Still banging head to get this fixed.>>
I will keep on updating this thread as more progress is made… Comments are welcome.
Cheers!
Chaladi
Thought I posted a comment, but it seems to have not worked as I dont see it.
Have you found any resolution to this issue? I ask as I am seeing something rather similar. I have an Exchange server running W2012 that after a week or so will no longer allow Outlook clients to connect. Pings, Outlook Web Access, Outlook Anywhere, etc. to the server is just fine. It just stops Outlook which I believe is using RPC. The other day, I had 1 user that could not ping the server, but everyone else was able to. I setup a packet capture on the server, filtered by his IP address, and noticed these TCP Spurious Retransmissions from his machine. As soon as I disabled the NIC and re-enabled it, his pings and communication started working fine. This disable/enable also resolves our Outlook issues when those happen.
Thanks.
LikeLike
Hello QS, apologies for the delayed response.
We still can’t figure it out. However my analysis hints me that there is some problem with MSS/MTU. Please try to limit the MTU size on any one of the impacted end user NIC card and see if this works.
Try to limit the MTU to 1400 or sth like that and test.
Please let me know your findings once you tested it.
Cheers!
Chaladi
LikeLike
For us, when our server stops responding to the RPC requests it effects ALL outlook clients. The issue with one machine failing to talk to the server at all might have been caused by our antivirus freaking out after further investigation.
As for the bigger issue with our CAS exchange server, 23 still see it once a week and nothing we’ve tried has worked thus far. If we preemptively disable / enable the NIC once a week we can manually avoid the crash. Appears to me there is some type of buffering or something on the NIC/NIC driver that’s overflowing.
We’ve disabled tcp offloading and a number of other driver functions in the vmxnet3 drivers.
LikeLike
Thanks xyzlor for sharing your findings. Have you used wireshark or netmon kinda tools and any findings from them?
LikeLike
Yeah — From what it looks likes to me, once the server goes into this state, the Outlook clients make the communication to the server and it begins to go through the authentication phase. It appears to get to the point of exchanging a certificate and the server then starts some of the TCP retransmissions which then appears to cause the client then just hangs for 3-5 minutes and Outlook eventually pops a can’t connect message. We initially thought it had to be a direct Exchange 2013 issue, but we’ve tried bouncing all the Exchange services when this happens and that doesn’t resolve it. Only thing that resolves it is bouncing the server or a disable / enable of the NIC. Neither vmWare or Microsoft have any idea either (opened support case with both). One of the strangest things I’ve run across in my 25 IT years….
LikeLike
You have done a brilliant job… and thanks for sharing the details. Yes, this TCP spurious re-transmissions issue has become a big headache for me too. Still can’t solve our issue, instead we deployed a new device in our site 🙂 – budget allowed to do so… 🙂
LikeLike