ArgoCD diffs at scale

Michał Plebański

Nov 28, 20257 min read

GitOps Overview

Today we’ll take a closer look at how we, at monday.com, manage the Kubernetes cluster state with GitOps and ArgoCD and how the change process looks like, especially around reviewing the change requests.

GitOps is a framework used to declaratively define desired state in a Git repository. This state, in our case the cluster state, is being fetched and applied at regular intervals to the targets by a service called ArgoCD.

ArgoCD offers multiple ways to define sources for the desired state: plain Kubernetes manifests, Helm charts, Kustomize, Jsonnet, and more. Most of our state sources are defined as Helm charts with corresponding Helm configuration values.

To keep everything DRY (don’t repeat yourself), we use a hierarchy of configuration files (overlays) that allow us to control resources at different levels of specificity, e.g., region-wide or environment-wide.

The good, the bad, and the ugly

Hierarchical overlays are a flexible abstraction: they allow us to modify resources at scale and stay DRY, but they also pose a risk, for multiple reasons:

Large blast radius – modifying least specific configuration values can affect a larger number of resources across multiple environments
Difficult to understand – hard to visually determine the merged result of overlays and therefore the final applied manifest
Onboarding difficulty – freshly onboarded developers don’t have confidence in the changes being introduced

Many critical development paths in our GitOps repository facing these problems have already been mitigated by automation: bots from e.g. CD (continuous deployment) pipelines or developer portal backends are already making changes to the state and developers interact with them only via user-friendly UI’s. Changes done by bots are smaller, less error-prone and done in steps with proper validation checks in-between.

Yet, there are still paths not covered by automation, and for those we decided that we need a proper diffing mechanism, a better way to understand what will be the final applied state.

Addressing the root cause

An important question pops up immediately: maybe we should move to the rendered manifests pattern? The rendered manifest pattern means: pre-render all your Helm charts, store them in the Git repository and let ArgoCD sync the result. Then what you see in Git (and PR diff) is what you get in the Kubernetes cluster.

That would address the root cause, reduce the load on ArgoCD and make the changes more explicit.

Long story short, the main reason we decided not to go this way is the significant migration effort. We are heavily committed to the current structure: we have a lot of tooling and automation around it. And we have a high number of critical apps to migrate to the new source definitions. That introduces an incident risk.

One more reason was the fact that the pull request UI was not suitable for browsing diffs at our scale. With many resource changes across many clusters, we saw a need to manipulate the diff view e.g: group resource changes by clusters, environment or similarity (diff hash).

Our approach

Manual changes to the GitOps state are done by opening a pull request, so the ideal flow that we have envisioned would be to open a new pull request and browse the diff within the context of the it.

The approach that we chose is to render the manifests on-the-fly for both target and head branches, in the CI system, and compare them A to B to generate a diff artifact that would be later displayed in a dedicated UI. A link to this UI will be posted as a comment on the pull request thread. Users then should follow the link and enter the UI to approve the diff.

There are three components to the solution:

Render CLI
Backend for storing and approving diff artifacts
UI for diff browsing

The most important part is how to render the manifests. There are a few constraints that we wanted to enforce:

High accuracy – include Kubernetes cluster version and capabilities (API versions) in the rendering process
Reasonable render time
No additional load on existing ArgoCD instances
Ability to test the Helm chart templates changes themselves, an option to locally override the fetching logic to use some local chart (instead of chart museum or VCS)
Support for custom CRDs

We also concluded that the diff should be for the desired state versus the new desired state, and not for the live state (with the use of e.g. “argocd diff” command), because that would require runtime access to many ArgoCD instances: imagine fetching the state of hundreds of apps on each pull request, network errors, broken apps, in progress syncs etc.. That would result in an often hard to understand, unpredictable final content and a very long render time. Tackling the drift between the live and the desired state is a problem for a different set of tools/processes.

Initially, we took upon an approach based on spinning up an ArgoCD instance with Kind (kubernetes-in-docker) inside the CI system and handing over the rendering to it, but that turned out to be too slow and hard to set up for local overrides – we would need to modify the manifests on the fly to be able to override the source URLs. Our apps have many sources at different revisions (the time to fetch each one of them sums up quickly), they run across many clusters and the clusters have different versions and capabilities.

A decision was made to develop a custom tool for rendering on top of the core “helm template” command with the behavior as close as possible resembling that of the ArgoCD.

The basic algorithm is as follows: take two repositories A and B, think of it as head and target branches of the same GitOps repo, render the manifests for them, as long as there are ArgoCD app manifests in the render queue and once the queue is empty build the diff artifact by comparing resources one by one. Make sure to ignore any fields specified in each app’s ‘ignore differences’ section.

The tool heavily utilizes caching of the VCS repos, fetches all Kubernetes clusters versions and capabilities ahead of time from real ArgoCD instances, and has a mechanism for overriding VCS and chart museum sources with local content.

Once the diff artifact is ready, we upload it to the diff service backend and redirect the author via a link in the pull request comment to the dedicated frontend. The UI has the ability to search, sort and group resources.

Group by functionality is where it really shines and fits for a larger scale e.g.:

group by an environment label to make sure that you only modify pre-prod resources,
group by cluster to make sure you only changed the clusters that you wanted to,
or group by the same diff hash to make sure that the label you are changing is really the only change across all modified resources.

The UI allows the user to define a grouping on any label defined on the resource and by the most common properties of Kubernetes resources e.g. Kind, ApiVersion or Name.

Summary

The good part is that the solution is fast, accurate – uses real kubernetes cluster capabilities, uses the same rendering algorithm as ArgoCD and has an improved review experience compared to default pull request UI. We make heavy use of the local overrides to see how changes to custom developed Helm charts would affect existing apps.

Overall the tool contributed to:

Lower incident risk – due to a better change visibility
Better productivity – its being used to shift left validation of the templating logic, if it’s not able to render the diff artifact the change is being blocked
Faster onboarding – new users only need to understand the Kubernetes resources, not the hierarchical abstraction and its intricacies

Yet, custom tools come at a price. The downside of the solution is that it is vulnerable to any change in the rendering logic of core ArgoCD which might result in a drift in the results. We might mitigate this risk in the future by extracting the ArgoCD modules responsible for rendering directly from the service, instead of recreating them “outside” of it.

ArgoCD diffs at scale

GitOps Overview

The good, the bad, and the ugly

Addressing the root cause

Our approach

Summary

Read More

Related Post

From API Chaos to Collaborative Graph – making API great again

Related Post

Zero-Downtime Cassandra Migration Between EKS Clusters

Related Post

Every Playwright Needs a Director – How AI Agents Replace DOM Scraping with Component-Aware Static Analysis