The Challenges in Migrating a 500,000 WAU SaaS Company to Kubernetes
Everything has grown dramatically from handling 20K req/min in our core application to 130K req/min.
Before we dive into the story, a few words for clarification: at monday.com, we have experienced constant hyper-growth in the last few years, with our customer base growing 3x each year, and our engineering teams doubling in size every year.
With great scale comes great responsibility to scale, and doing so in a way that will allow us to ensure our availability and resiliency, while still allowing our engineering teams to stay agile and develop new features.
During our first years as a small startup, when delivery is the most important thing we could focus on, we had all our workloads running on top of a PaaS called Engine Yard, which is a cloud service running on top of AWS for Ruby on Rails applications. As we grew, we realized that we need more liberty in our infrastructure design power and so we migrated our workloads directly to AWS, using self-managed servers and some Infra-as-Code tooling. But, this is a story for a different blog post.
A year has gone by and again, everything has grown dramatically. We grew from 15 servers to over 100 servers, from using one central database to many and from handling 20K req/min in our core application to 130K req/min.
At that point, we found ourselves against a wall. There were just so many moving parts! We were using Ansible, Terraform, Packer, and our own set of custom scripts. The deployment process was Blue/Green, so we had to provision a whole set of new instances every time a new release was available (which was > 20 times a day).
As a result, we started to experience a lot of failures during provisioning and deployment, and it was hard to troubleshoot every time.
We came to realize that it was time to migrate our workloads again, and we were looking for an option that’s both mature, stable, scalable, dynamic, and maintained by a huge community. After some research we’ve decided to go with Kubernetes, or to be precise – AWS’s managed option called EKS.
For us to understand the complexity of the project we first need to understand its challenges:
- Our Ruby on Rails primary application uses Sidekiq to manage its asynchronous jobs. Sidekiq has worker processes that consume one or more queues, a collection of worker processes is running on an EC2 instance. Do we keep this model, where worker processes consume a lot of different queues, or do we split queues into different processes and achieve more fine-grained scalability and resource utilization?
- We are using Nginx in front of our Unicorn web servers when running on EC2. Each application server has exactly the same amount of Nginx server workers and Unicorn workers. What will be the best solution when migrating to Kubernetes?
- We are using a WAF that’s installed as an Nginx module. How can we use it when running on Kubernetes?
- We are deploying our application in a Blue/Green deployment strategy, we must preserve it.
- We had to give our developers easy access to staging pods for debugging, but we decided to find a way to make it completely transparent, without making a lot of changes to their day-to-day routines.
- Last but not least, we need to plan how we are going to actually perform the traffic steering from our old EC2 instances to K8s.
To better understand these challenges, let’s dive into a few key ones.
Async jobs – Sidekiq
When we started thinking about Sidekiq queues our current running design was very complex, many EC2’s with many processes consuming many queues, it was impossible to manage and not scalable with many resiliency issues.
As stated before, our Sidekiq worker processes were running all over the place when deployed on EC2. We had a fixed set of instances, each consuming and processing from all queues, giving us a known maximum capacity per queue. This means that in order to cope with bursts of messages, we had to massively overprovision our infrastructure.
Our first and relatively simple decision we made was to template the concept of a “worker process” and use a Helm chart to instantiate capacity per queue.
By templating how one Sidekiq worker is deployed, and assuming that every worker consumes a single queue, we could deploy as many workers we need, and have each queue consumer group scale independently of other workers.
Speaking of scaling, we knew right from the beginning that using CPU or Memory for our HPA (Horizontal Pod Autoscaler, Kubernetes’ definition object of auto-scaling per workload) won’t do the trick for us, so we decided to scale according to network throughput which reflects the number of events pending to be processed for a certain queue.
But, we already had Sidekiq queue metrics reported to AWS CloudWatch, like the number of processed messages and queue latency, which we use for application alerts to our on-call developer. So, why not use it also for our HPA as custom metrics?
AWS officially supports that capability by using the k8s-cloudwatch-adapter, so we could use the same Helm template feature for our HPA.
With these decisions, our new state is much more stable, scalable, and resilient.
Blue / Green Deployment
K8s natively supports 2 deployment methods:
- RollingUpdate, which provisions a subset of new pods and then gradually replaces old with new ones
- Recreate, which waits first for all old pods to terminate and then schedules new pods
“Recreate” is obviously unsuitable for a web application that needs to stay available during deployments. “RollingUpdate” is a much better solution, however, it bears the underlying assumption that all versions are necessarily both backward and forward compatible. For our primary application, sadly, this was not the case.
This means that we must stick with Blue / Green deployments, so again we start by exploring our options.
One option would be embarking upon a journey and implementing it ourselves by using labels and services. We also thought about implementing it using Pulumi, an Infrastructure as Code toolkit that has a K8s library we can leverage. Another direction that came to mind was searching for CD tools that support Blue/Green such as Harness and Spinnaker.
All of those solutions, while feasible, were too complicated. If we go with implementing it using scripts or Pulumi, we have to add another new technology to our stack and obviously, we need to maintain it. If we choose a tool that already supports Blue/Green deployments we will need to invest a significant amount of time in learning and implementing a new CD tool which is a huge overhead to our project.
We then stumbled upon a relatively new tool, argo-rollouts. It’s a K8s controller and a set of CRD’s which provide deployment capabilities such as blue/green and canary. We liked their approach of adding the missing K8s capabilities using CRDs.
Activating the Blue/Green deployment strategy was actually pretty simple and required installing a few CRDs and slightly changing our main deployment.yaml definition. We could still use the helm upgrade command the same as before, which was great since we didn’t have to add a new tool to our stack and we didn’t have to maintain complex scripts.
Application pods running in Kubernetes are not intended to be connected to. Still, we needed to allow our developers the ability to run scripts and basic commands against the production application for maintenance, support, and data migration purposes.
For that, we came up with what we call a “developer mode”.
A monday.com developer has access to an in-house tool we created called the “monday CLI”, allowing them to run a multitude of commands useful for their day-to-day.
Into that existing infrastructure, we integrated the capabilities of kubectl (the Kubernetes CLI) with custom-built configurations. That way, by executing a simple command on a developer’s workstation, alongside their personal credentials, it is possible for that developer to easily provision a dedicated pod with their name, running one of our production builds in a “detached” mode (so the pod does not receive any actual production traffic). We even took it a bit further and opened a shell to the pod automatically.
Once the developer launches a new pod using the CLI, they’re given the choice to reconnect to their existing pod which they launched earlier or reprovision their pod with the latest application version. Since developer pods are short-lived, the developer can also choose the pod’s TTL. After everything is set, the CLI notifies the developer of the pod’s name, its remaining TTL, and the application version.
TTL is managed by a recurring Kubernetes CronJob that fetches the list of running pods from the Kubernetes API and terminates any pod that has exceeded its allocated TTL.
With solving those key challenges among many other challenges such as Docker image complexity and size signature, networking using nginx-ingress-controller, and many more, we can proudly say monday.com is fully running on Kubernetes, we now have more control on our infrastructure and can scale to give our users best experience possible.