Horrors of using Azure Kubernetes Service in production

Azure Kubernetes Service (AKS) was recently marked as GA. We decided to move our production workload to it last month. Following is an account of what its really like to use it in production.

1. Random DNS failures

We started seeing random DNS failures right away, both for domains outside azure (eg sqs.us-east-1.amazonaws.com) and even for hostnames inside the Azure Virtual Network. While they would resolve eventually after multiple retries, it was surprising that a fundamental feature like DNS would be broken.

Azure Support told us that resolution of DNS names that point outside of Azure was not their problem (which is surprising, since that is what DNS is for). They would only work on DNS failures for hostnames inside Azure.

Azure resolved the issue by blaming CPU/Memory usage. We were told not to use too much CPU/Memory if we wanted the DNS to work reliably!

Apart from the ridiculousness of this resolution, they ignored our response when we told them the issue mostly surfaces duing application startup when cpu/memory usage is minimal.

2. Required daily reboot of Kubernetes API Server

After a few days we noticed that we could no longer launch Kubernetes Dashboard. After going through a harrowing time of dealing with multiple azure support personnel, this issue was resolved to be valid and the only resolution was to reboot the Kubernetes API Server. Since the API Server is managed by Azure, this meant opening a support ticket, escalating it to engineering, and then asking them to reboot it.

This problem would resurface daily, so we had to open a ticket every single day and escalate it to get the API Server rebooted.

I had to document this procedure in an email to Azure Support so they could escalate the daily tickets without asking too many questions over and over again.

doc

3. Container crash would bring down entire node

If a docker image would crash, it would bring down the entire underlying VM. The only way to recover would be to login to the Azure portal and manually reboot the vm.

The resolution by Azure Support: "Yeah this is your problem. Just make sure your containers never crash".

This is what its like to use @Azure Kubernetes in production. Every day since the last week pic.twitter.com/BLxn5EwkW4
— Prashant Deva (@pdeva) July 13, 2018

4. Unrecoverable cluster crash

One day I woke up to find every single node in the Kubernetes cluster was down! Rebooting the nodes from Azure portal did nothing.

Azure support tried to bring the cluster back up, but the nodes would keep going down regardless. Eventually after > 8 hrs, they finally brought the cluster back up, but we could no longer run any containers on it! From that point on, our containers wouldnt even start, the error message would point to some Golang code (our application is in Java). Yet Azure support blamed it as 'issue on your end' and closed the ticket.

DiCEtMdVQAAHTTy

We now sat with a cluster we couldn't deploy containers on, but which Azure considered fine.

5. SLA violation ignored

Even though there is no SLA for AKS, the individual VM nodes do have to abide by the 99.9% SLA of Azure. Since the VMs were down for many, many hours we opened a ticket to claim SLA on it, and Azure simply ignored it! It remains open and ignored weeks later.

Conclusion

Azure Kubernetes Service (AKS) is alpha product marked as GA by Microsoft.
Azure Support has been the worst support experience of my life. Not only were our P1 tickets (with <1hr response time) answered after >24 hours, the resolution of the tickets were laughable. Ignoring the SLA violation is downright fradulent behavior.

We have finally moved to Google Cloud which has the best Kubernetes implementation out there.