Article summary
I ran into an issue where I was unable to run more than one system test at a time without the second (and all other tests) failing. These particular tests involved starting an HTTP server to act as the external API that the application would interact with during the course of the tests. The problem was that after the first test completed, the HTTP server would fail to start, getting an “address already in use” error.
Running netstat
after my tests ran revealed that the server port, 8085, was in a TIME_WAIT
state:
> netstat -an | grep 8085
tcp4 0 0 127.0.0.1.8085 127.0.0.1.58244 TIME_WAIT
I’ve done enough network troubleshooting over the years to be somewhat familiar with TIME_WAIT
. However, I needed to dig into it again to understand why the server port in my test suite was ending up in a TIME_WAIT
state and why that was preventing other tests from running properly. I’ll share my findings in this post.
Address Already in Use
The output of the netstat
command above shows that the TCP connection from localhost 8085 to localhost 58244 is in a TIME_WAIT
state. On my laptop, which is running Mac OS X 10.13, it would stay in this state for 30 seconds before clearing. During that 30 seconds, any attempt to start a server that would listen on port 8085 would fail because the port was considered in use until the TIME_WAIT
cleared.
Being in a TIME_WAIT
state, as explained in TCP: About FIN_WAIT_2, TIME_WAIT and CLOSE_WAIT means:
.. that from the local end-point point of view, the connection is closed but we’re still waiting before accepting a new connection in order to prevent delayed duplicate packets from the previous connection from being accepted by the new connection.
Active Close Gets the TIME_WAIT
The HTTP server in my test couldn’t start listening on port 8085 while there was a socket in the TIME_WAIT
state. So how was it ending up in that state?
From TIME_WAIT and its design implications for protocols and scalable client server systems:
… it’s the final state that the peer that initiates the “active close” ends up in and this can be either the client or the server.
This meant that my HTTP server must have been initiating the close, since the server port was the one that was ending up in TIME_WAIT
.
Solution
The solution to my problem was to make sure the client was the one that actively closed the connection first. As long as the client actively closed the connection before the end of the test (when the server would be shut down), the client’s port would get the TIME_WAIT
.
> netstat -an | grep 8085
tcp4 0 0 127.0.0.1.60382 127.0.0.1.8085 TIME_WAIT
tcp4 0 0 127.0.0.1.60381 127.0.0.1.8085 TIME_WAIT
tcp4 0 0 127.0.0.1.60383 127.0.0.1.8085 TIME_WAIT
This meant that the server port was no longer in use. When the next test started, the server could start listening on port 8085 without getting an error.
Figuring out how to do that turned out to be a little tricky in the .NET environment where I was running my tests (using the System.Net.Http.HttpClient
class), and I hope to write a follow-up post about that experience someday soon.