Negative impact of AVX Workloads on Cloud VMs and Kubernetes clusters

Intel processors offer the AVX-512 instruction set to allow high performance for vectorized workloads. You would be correct in being tempted to use it on your applications/databases deployed in the cloud.

However, there is a flip side to it.

Using the AVX instructions will cause the entire processor to get clocked down! This has huge implications.

Affect on Cloud VMs

The AVX slowdown doesn't care about VM boundaries. When you rent a VM on AWS, GCP, etc, you are getting access to just a few of the many cores from any physical processor.

Lets say a processor on AWS has 4 cores, and you request 2 for your VM. Another account B on AWS spins up a VM and gets assigned 2 of the remaining cores from that same processor. Now B starts running some AVX heavy workload. Well, what do you know, it results in your VMs getting slowed down too!

AVX512 is architecturally transparent to VTx (or the other way around, depending on how you view these things).

Turbo penalty for AVX512 is package wide and gives zero fucks about your VM boundaries :)

(Put differently: what Kelly said yesterday)
— Jon Olson (@jonolson) August 14, 2018

Affect on Kubernetes clusters

It means your own docker containers running AVX workloads can slowdown your other containers, despite the resource limits being set. Not only that, a different account's Kubernetes cluster which has pods scheduled on a different VM but on the same physical processor as your VM can impact your containers!

This was pointed out by Kelly Sommers yesterday.

So here’s a real question. What does Amazon and Microsoft and other kubernetes cloud services do to prevent your containers from losing 11ghz of performance because someone deployed some AVX optimized algorithm on the same host?
— Kelly Sommers (@kellabyte) August 13, 2018

No easy way out

I went ahead and filed a Kubernetes bug #67355 for this. They do seem to be aware of this issue but currently have no good answers for it:

Even Intel has no solution for it currently:

intel

Conclusion

This is a Catch-22 situation all around. Cloud vendors want to offer VMs with AVX-512 instructions enabled to allow their users to get better performance. It is in the best interest of the individual user to use it. However, doing so may not only impact their own VMs/containers but even another account's VMs.