If you are using WebSockets, you are bound to run into this issue. Unfortunately none of the available popular WebSocket libraries solves this.

Lets say your WebSocket based application is running on 2 servers A & B. Its connected by RabbitMQ and a Load Balancer (LB) endpoint for users.

Connections come in and the LB distributes them evently across the 2 nodes.

After a while you notice the connection load is too high for just 2 servers.  So you add another server C.

However, C doesn't get any of the existing traffic! WebSocket connections are persistent so they stick to whatever node they are connected to.

So you restart A & B, hoping clients would reconnect and thus load would be evenly distributed. Assuming both A & B restart at the exact same time, this is what happens next:

C has all the connections now and A & B have none. This is even worse than what you started with!

What happened is that while A & B were restarting, all existing connections reconnected to C. If you restart C now, you will be back to your original situation of all traffic being at A & B and none at C.

This persistent connection behavior can surprise you in many ways. Simple operations, like doing a rolling upgrade of your nodes, can result in unexpected behavior like all traffic being sent to just one node or the last node upgraded not getting any traffic at all. This can result in huge scalability issues for your application.

Any Solution?

Surprisingly, there isn't any!  As Sam points out, this is a tough problem that requires inventing a whole protocol in WebSocket frameworks, both client and server side, to do correctly. And unfortunately none of them currently do so.

The only kinda, sorta solution is AWS API Gateway (which was also pointed out by Sam).

With AWS API Gateway, your Websockets are connected to the Gateway instead of your servers. The Gateway then talks to your applications via http.  But since the Gateway is using vanilla Websockets, even it can run into the same issue I described above, where underneath, all your connections end up being connected to a single server inside the Gateway.  

It seems folks on Twitter caution about the same:


If you are looking to use WebSockets, it might be worth pondering this issue. At the very least, due to the additional complexity introduced, WebSockets should only be used if plain old HTTP really is too slow for you.

I do hope WebSockets frameworks like Spring Websockets put thought into this issue and build a protocol to solve this.