Building a resilient and scalable infrastructure with Kubernetes Multi-Cluster
Infrastructure

Building a resilient and scalable infrastructure with Kubernetes Multi-Cluster

Eilon Moalem
Eilon Moalem

Over the years, monday.com has grown significantly, reaching 700k requests per minute (RPM) and approximately 1,500 nodes/25,000 pods at peak times. This resulted in a “monolithic cluster,” where every major infrastructural change had far-reaching effects on the entire customer base.

Three milestones accelerated our need to implement a multi-cluster contingency plan:

  1. After watching the keynote by Spotify, “How Spotify Accidentally Deleted All its Kube Clusters with No User Impact” we had in mind that we have one cluster only, but no motivation yet to work on starting the multi cluster project. We then accidentally deleted our production cluster with a user impact, causing 1 hour of downtime
  2. Another challenge we faced was that our AWS-managed Kubernetes control plane was running on the largest available instance types, causing us to hit some rate limits with no place to grow.
  3. With our emphasis on zero downtime major infrastructural changes, we had to have a better approach than in-place upgrades, regardless of how well planned they were (have you tried to change your CNI in-place?)

As a solution, we decided to break the monolithic cluster into k smaller, but otherwise identical clusters – hosting the exact same micro services and still utilizing shared resources, with a high certainty of not facing any scale limits in the future.

In this blog, we will cover the challenges of how we created identical clusters using GitOps (ArgoCD), managing Kubernetes resources and its convention (Helm), traffic split (Cloudflare) and monitoring. In addition, we will cover the surprisingly biggest win of this project – the ability to perform infrastructure A/B testing where each cluster is with slightly different configurations or having a canary infrastructure change. For example, you could test different instance types, different network configurations, or different Kubernetes versions.


We broke the task into few sub-tasks to answer some of the biggest challenges: cluster configuration and maintenance, traffic management, autoscaling and monitoring:

Challenge #1 – cluster configuration and maintenance

Our first challenge was how to create clusters that are almost identical; while keeping it easy to ensure that our clusters are always in the desired state, and that deployments are consistent across different environments. We therefore moved from CD of “helm install” to ArgoCD which is GitOps-based continuous delivery tool for Kubernetes where each deployment (either infrastructure system or micro service) represented as ApplicationSet having common structure, and the same helm chart but different value files with a few levels of variance:

  1. Environment – staging/production (short names stage, prod)
  2. Region – us-east-1, eu-central-2, etc. (short names use1, euc2)
  3. Cluster name – providing each cluster with a unique name where our choices were based on Pokemon names (there was a need for a pool of intuitive names)

Naming conventions are a key part in our infrastructure because they help organize and standardize the naming of resources, making it easier to understand, maintain, and monitor all infrastructure: Deployments, Clusters, Databases, Queues or any other resource.

One of the key components of the project was to automate tasks such as provisioning, adding new clusters, and monitoring them without extra steps. For example, if a naming convention is used to identify resources, automation tools as alerts can be used to identify those resources.

To tackle the first challenge, the deployment process needed to be entirely reconstructed as a preliminary stage.

The new deployment process – GitOps

CI/CD process – One of the steps within our CI/CD is to commit a new image tag to the GitOps files- the script itself does not receive as an input in which clusters this application should be deployed, but auto discovering it by reading the configuration (read below) files and checks if the specific application is inside the file. This provides us with the ability to add additional clusters without conducting any manual actions in the process.

GitOps repository – as previously mentioned, ArgoCD enables us to manage deployments across multiple clusters. In this repository, each microservice is represented by applicationset

ArgoCD app of apps is a feature of ArgoCD that allows you to manage multiple applicationsets and their dependencies using a single Git repository.

We split applications into two families:

  1. Infra applications: cluster autoscaler, prometheus, coreDNS, etc.
  2. Business applications (micro services): Login, Authentication, Authorization, Notifications, etc.

Example for the business applications structure:

├── templates
│   ├── login_applicationset.yml
│   ├── authorization_applicationset.yml
│   ├── authentication_applicationset.yml
│   ├── notifications_applicationset.yml
├── stage-apse2-mgmt-gengar.yaml
├── stage-apse2-monday-mew.yaml
├── stage-euc1-mgmt-dragonite.yaml
├── stage-euc1-monday-charmander.yaml
├── stage-use1-bigbrain-hypno.yaml
├── stage-use1-mgmt-psyduck.yaml
├── stage-use1-mgmtbb-abra.yaml
├── stage-use1-monday-bulbasaur.yaml
├── stage-use1-monday-lucario.yaml

App of apps:

project: default
source:
  repoURL: 'REPO_URL'
  path: apps/business/shared
  targetRevision: HEAD
  helm:
    valueFiles:
      - prod-use1-monday-snorlax.yaml
destination:
  server: 'https://kubernetes.default.svc'
  namespace: argocd
syncPolicy:
  automated: {}
revisionHistoryLimit: 10

Applicationset:

{{ if .Values.services.authentication }}
{{ $service :=  .Values.services.authentication }}
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: {{ .Values.applicationSetPrefix }}-authentication
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - cluster: {{ .Values.cluster }}
            url: {{ .Values.url }}
  template:
    metadata:
      name: '{{`{{cluster}}`}}-authentication'
      annotations:
        notifications.argoproj.io/subscribe.on-sync-succeeded.slack: argocd-{{ .Values.env }}
        notifications.argoproj.io/subscribe.on-sync-failed.slack: argocd-{{ .Values.env }}
        notifications.argoproj.io/subscribe.on-deployed.slack: argocd-{{ .Values.env }}
        notifications.argoproj.io/subscribe.on-health-degraded.slack: argocd-{{ .Values.env }}
      labels:
        monday.type: micro-service
        monday.service: authentication
        monday.cluster: '{{`{{cluster}}`}}'
        monday.deployment: main
        {{- if or ((.Values.global).disableArgoWait) (($service.override).disableArgoWait) }}
        argo.wait.enabled: 'false'
        {{- else }}
        argo.wait.enabled: 'true'
        {{- end }}
    spec:
      project: default
      source:
        path: ''
        repoURL: {{ .Values.chartmuseumUrl }}
        targetRevision: {{ $service.targetRevision | quote }}
        chart: authentication
        helm:
          valueFiles:
            - values.yaml
            - {{ .Values.shortEnv }}.yaml
            - {{ .Values.cluster }}.yaml
          parameters:
            - name: deployment.image.tag
              value: {{ $service.imageTag | quote }}
              {{- if .Values.global.disableCrons }}
            - name: global.traffic_precentage
              {{- if ($service.override).traffic_precentage }}
              value: {{ $service.override.traffic_precentage | quote}}  
              {{- else }}
              value: {{ .Values.global.traffic_precentage | quote}}  
              {{- end }}
            {{- range $k, $v := $service.extraParameters }}
            - name: {{ $k }}
              value: {{ $v | quote }}
            {{- end }}  
      destination:
        server: '{{`{{url}}`}}'
        namespace: authentication
      {{- if ($service.override).disableAutoSync }}
      syncPolicy: {}
      {{- else }}
      syncPolicy:
        automated:
          {{- if eq .Values.env "staging" }}
          prune: true
          {{- else }}
          prune: false
          {{- end}}
          selfHeal: false
        syncOptions:
          - CreateNamespace=true
      {{- end }}
{{ end }}


Each of the cluster configuration file represents one “app of apps”, metadata and its deployments

env: production
url: https://XXXXXXXXX.gr7.us-east-1.eks.amazonaws.com
chartmuseumUrl: https://remote.chartmueseum.XX
cluster: prod-use1-monday-staryu

services:
  login:
    targetRevision: 0.0.23
    imageTag: a3455c9eab980e71ab0153ff749dfe266bb400a8
  authorization:
    targetRevision: 0.3.33
    imageTag: 2cdfc97c273ec3132d2dcc41d9ca71cb27f5b0b6
  authentication:
    targetRevision: 0.3.56
    imageTag: 562c887f2d360e04d17aaaabcbf9c940576fd45c
  notifications:
    targetRevision: 0.3.41
    imageTag: f7470c0ba1fed674e82d3180d73641734fa6942f

Altering the deployment process enabled us to overcome the first challenge. The following two examples will demonstrate how we face cluster configuration. 

New cluster – two configuration files for the clusters are required (infra and business), selecting the needed applications, adding them to the file, and finally committing them to the gitops repository. Making the process of creating new cluster to the point it is ready to serve traffic extremely short (or 1 hour)

Single application – deploying a new application or upgrading an existing one requires the same effort invested in multiple clusters as in a single cluster. 

The action required to deploy the application is to create/update the helm chart, and set the correct chart revision and image tag in the cluster configuration file. 


Challenge #2 – Traffic management

We wanted to create a traffic split based on traffic percentage, while keeping the process of adding new clusters simple, or stopping traffic to other clusters in one action. We chose to use Cloudflare load balancer rules that route traffic based on various criteria such as directing a percentage of traffic to each cluster, while bearing in mind that modifying a system as sensitive as Cloudflare, especially during incidents, may increase the severity of the incident. As a solution, we imported all Cloudflare resources to terraform, thereby, simplifying the process of traffic splitting to one source code.

locals {
  prod_use1_squirtle_ambassador_name = "prod-use1-squirtle-ambassador"
  prod_use1_snorlax_ambassador_name  = "prod-use1-snorlax-ambassador"
  prod_use1_staryu_ambassador_name   = "prod-use1-staryu-ambassador"

  traffic_split = {
    use1 = {
      squirtle = 0.33
      snorlax  = 0.33
      staryu   = 0.34
    }
}

resource "cloudflare_load_balancer_pool" "prod_use1" {
  name = "prod-api"
  origins {
    name    = local.prod_use1_squirtle_ambassador_name
    address = data.aws_lb.prod_use1_monday_squirtle_ambassador.dns_name
    enabled = true
    weight  = local.traffic_split.use1.squirtle
  }
  origins {
    name    = local.prod_use1_snorlax_ambassador_name
    address = data.aws_lb.prod_use1_monday_snorlax_ambassador.dns_name
    enabled = true
    weight  = local.traffic_split.use1.snorlax
  }
  origins {
    name    = local.prod_use1_staryu_ambassador_name
    address = data.aws_lb.prod_use1_monday_staryu_ambassador.dns_name
    enabled = true
    weight  = local.traffic_split.use1.staryu
  }
}

### data

data "aws_lb" "prod_use1_monday_staryu_ambassador" {
  provider = aws.production-use1
  tags = {
    "ingress.k8s.aws/stack" = "ambassador-external/prod-use1-ambassador-external"
    "elbv2.k8s.aws/cluster" = "prod-use1-monday-staryu"
  }
}

data "aws_lb" "prod_use1_monday_snorlax_ambassador" {
  provider = aws.production-use1
  tags = {
    "ingress.k8s.aws/stack" = "ambassador-external/prod-use1-ambassador-external"
    "elbv2.k8s.aws/cluster" = "prod-use1-monday-snorlax"
  }
}

data "aws_lb" "prod_use1_monday_squirtle_ambassador" {
  provider = aws.production-use1
  tags = {
    "ingress.k8s.aws/stack" = "ambassador-external/prod-use1-ambassador-external"
    "elbv2.k8s.aws/cluster" = "prod-use1-monday-squirtle"
  }
}

Challenge #3 – Autoscaling

We have a few clusters that serve the same customer base. Each cluster is independent in certain aspects, but always correlates to the traffic percentage. The monolithic cluster architecture served 100% of the traffic, and we invested immense thought into its autoscaling. Migrating to multi cluster architecture raised an additional challenge; each cluster began serving only part of the total traffic making the previous optimization irrelevant.

As a solution, we created a formula based on the traffic percentage that each cluster receives. The formula bridges the gap between the current configuration and the cluster’s traffic percentage. Therefore, it is irrelevant how many clusters are added- we can assure that the environment will be cost-optimized and the performance will be upheld. 

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler

{{- $traffic_precentage :=  .Values.global.traffic_precentage }}
{{- if eq ( $traffic_precentage | quote ) "" }}
{{- $traffic_precentage =  100.0 }}
{{- end }}  

{{- $defaultMinReplicas := ternary  "1" $defaultMinReplicas (le ( $traffic_precentage | float64 ) 1.0 ) }}
{{- $defaultMaxReplicas := ternary  "1" $defaultMaxReplicas (le ( $traffic_precentage | float64 ) 1.0 ) }}

{{- $adjusted_min_replicas := max $defaultMinReplicas (div (mul $hpa.minReplicas ($traffic_precentage)) 100)  }}
{{- $adjusted_max_replicas := max $defaultMaxReplicas (div (mul $hpa.maxReplicas ($traffic_precentage)) 100)  }}

Challenge #4 – Monitoring

One of the biggest challenges is changing the concept of monitoring within the team and among R&D as a whole. In a monolithic cluster architecture, we monitored the entire region by monitoring the cluster. In a multicluster, we monitor the region by accumulating metrics and statistics from all clusters, however, monitoring the clusters individually as well. As a solution, we used naming conventions and tags for almost all resources imaginable, thus, enabling us to zoom in and out. 

The following principles demonstrate how monitoring has been obtained:

  • Monitoring the health and performance of each individual cluster includes its resource usage, error rate, pod status, and network traffic etc. By using generic alerts that are grouped by cluster, we are able to add as many clusters as needed without worrying about adjusting the alerts.
  • Centralized logging collects and aggregates logs from all clusters into a centralized logging system, making it easy to search, analyze and troubleshoot issues, at a cluster level as well.
  • Multi-cluster visualization was an important principle, regardless of its effect on the generic approach. We created a few golden dashboards that were tailor made to each cluster to provide a comprehensive view of the entire multi-cluster environment and its components.
  • Cross cluster alerts are a key part while using the other clusters as a baseline. They help in identifying drift in deployment versions. For example, how can we ensure that nothing has failed from the moment ArgoCD initiated the deployments in all regions and clusters and it completed in the entire ecosystem, and is serving customers with the exact same application version? This can be demonstrated in the way we use Datadog to get alerts if deployment doesn’t have the same version (image tag) across all the clusters for longer than two hours. This is performed by calculating the number of pods in the entire production group by namespace, minus the number of pods in the entire production group by version. Therefore, if the total pods is not equal to the number of pods per version, then this infers that there is a version mismatch.


Bonus – Infrastructure releases and A/B testing

As a result of this project, we began to understand how essential multicluster is for infrastructure changes, providing us with broader abilities, higher limits, and it is just the beginning. In other words, we can execute canary deployments within a cluster’s scope in real-time while closely emulating the scale and behaviors of the corresponding cluster.

by stopping to send traffic to one cluster (cloudflare), preparing the other clusters by changing its traffic percentage leading to adjusted HPA, making the necessary changes, and then gradually moving traffic back to the upgraded cluster. This is the game changer for us, transforming previously high-risk changes into safe, gradual releases.

What about minor changes?

  • What if you want to understand if your HPA is based on the right metric?
  • Which is a better routing policy in reversed proxy?
  • What is the right number for Liveness and Readiness?
  • Does the application really need those resources?
  • Is Karpenter better than cluster auto-scaler?

With a new infra release process, while we define the right KPIs, collect and analyze metrics from each cluster, this will help us understand how each cluster performs, and identify any potential issues by comparing metrics such as response time, throughput, and error rate etc.

It was an ongoing project for almost a year, and every step required thorough research and planning. At this stage we are proud to deliver our customers with a more resilient system, at the exact same cost,and many new promising changes that will inevitably be dependent on the multi cluster and its abilities.