TIME_WAIT and Your Test Suite

I ran into an issue where I was unable to run more than one system test at a time without the second (and all other tests) failing. These particular tests involved starting an HTTP server to act as the external API that the application would interact with during the course of the tests. The problem was that after the first test completed, the HTTP server would fail to start, getting an “address already in use” error.

Running netstat after my tests ran revealed that the server port, 8085, was in a TIME_WAIT state:


> netstat -an | grep 8085
tcp4  0  0  127.0.0.1.8085  127.0.0.1.58244  TIME_WAIT

I’ve done enough network troubleshooting over the years to be somewhat familiar with TIME_WAIT. However, I needed to dig into it again to understand why the server port in my test suite was ending up in a TIME_WAIT state and why that was preventing other tests from running properly. I’ll share my findings in this post.

Address Already in Use

The output of the netstat command above shows that the TCP connection from localhost 8085 to localhost 58244 is in a TIME_WAIT state. On my laptop, which is running Mac OS X 10.13, it would stay in this state for 30 seconds before clearing. During that 30 seconds, any attempt to start a server that would listen on port 8085 would fail because the port was considered in use until the TIME_WAIT cleared.

Being in a TIME_WAIT state, as explained in TCP: About FIN_WAIT_2, TIME_WAIT and CLOSE_WAIT means:

.. that from the local end-point point of view, the connection is closed but we’re still waiting before accepting a new connection in order to prevent delayed duplicate packets from the previous connection from being accepted by the new connection.

Active Close Gets the TIME_WAIT

The HTTP server in my test couldn’t start listening on port 8085 while there was a socket in the TIME_WAIT state. So how was it ending up in that state?

From TIME_WAIT and its design implications for protocols and scalable client server systems:

… it’s the final state that the peer that initiates the “active close” ends up in and this can be either the client or the server.

This meant that my HTTP server must have been initiating the close, since the server port was the one that was ending up in TIME_WAIT.

Solution

The solution to my problem was to make sure the client was the one that actively closed the connection first. As long as the client actively closed the connection before the end of the test (when the server would be shut down), the client’s port would get the TIME_WAIT.


> netstat -an | grep 8085
tcp4  0  0  127.0.0.1.60382   127.0.0.1.8085   TIME_WAIT
tcp4  0  0  127.0.0.1.60381   127.0.0.1.8085   TIME_WAIT
tcp4  0  0  127.0.0.1.60383   127.0.0.1.8085   TIME_WAIT

This meant that the server port was no longer in use. When the next test started, the server could start listening on port 8085 without getting an error.

Figuring out how to do that turned out to be a little tricky in the .NET environment where I was running my tests (using the System.Net.Http.HttpClient class), and I hope to write a follow-up post about that experience someday soon.