
From Firefighting to Fireproofing Production
Over three months ago, I had the pleasure of joining monday.com in its challenge with bulletproofing production systems. Today, I would like to take you, my dear reader, into the world of reliability engineering, with a focus on production-running systems.
monday.com these days is no longer a small startup; it’s a powerful scale-up running businesses for ~245,000 customers worldwide, and it needs to think and plan actions to ensure reliability is a first-class citizen across all products.
The Need for Auditing Production Systems
As you most probably know, there’s an excellent analogy for Site Reliability Engineers: firefighters. They do not only perform reactive work when a fire occurs. They plan and audit specific construction projects in advance to ensure the fire will not occur for particular reasons and to minimize risk. It’s proactive work that is extremely powerful.
However, to audit effectively, you need a clear set of safety standards, essentially our engineering building codes. By the time I joined, a cross-team effort between Reliability and R&D Foundations had already produced brilliant work in establishing Production Principles. These now serve as the guiding North Star for the engineering teams. Also, these principles are high-level fundamentals, and any change we make should adhere to them in terms of reliability and resiliency, security, performance, observability, and more – it’s how the systems should operate in production.

The team had already clarified the list of the Production Readiness Checklist (PRC), “from engineers to engineers.” This framework was used to estimate the stage of production readiness for given services, which helped ask the right questions and find relevant answers, and to declare how ready the system was. For instance, the questions may have been:
- Does it use Canary Releases?
- Does it rate-limit all entry points?
- Does it validate schemas on all entry points?
- Does it use the recommended observability toolkit?
Each of the engineering teams also has definition documents that they had agreed upon, explaining their meanings. All had to be concrete (clearly outlining what service owners were expected to do) and fully aligned with both infrastructure teams and development teams on what was essential for both to achieve operational excellence and reliability. They also had to be actionable.
How monday.com Ensures a Preventive Approach
The next superior initiative that monday.com ran is called Production Readiness Review (PRR). If the PRC is the framework, then the review is… the firefighters’ audit. Initially, it was a manual process to learn how it worked and determine its effectiveness, as “Done is better than perfect.” That’s my favorite attitude in cases like this. For kickoff time, it’s excellent, but with a growing company, services, and teams, it might not be enough.
In this case, we began implementing automations built on Sphera, our Internal Developer Portal (IDP), to help teams identify risks in real-time within the service catalog. Instead of stopping everything and rushing into a new project, step by step, checks started being automated one by one to reduce toil and provide data-driven information.
A good example of checks that demonstrate the effort and manual work required to ensure production systems will not fail is my first onboarding task at monday.com. The task was to implement a check called “Connection Over Provisioning”, which is valuable and crucial for environments like monday.com to be aligned with. For context:
- The services: We have over 400 microservices running in production on Kubernetes with horizontal autoscaling (adding new instances when traffic spikes).
- The pods: Each pod has a configurable maximum connection pool size (how many connections one instance can open).
- The database: Each database has a hard limit on the maximum number of connections it can accept.
There are, of course, more complex scenarios involving proxies and n-read-replicas, but let’s limit this example to the simple single-database issue.
Manually validating this is a nightmare. You must constantly calculate whether (Max Replicas × Connection Pool Size) exceeds the Database Max Connections. It’s impossible with hundreds of services.
The solution
To address this, we implemented an automated PRC that runs in real-time. Instead of relying on human memory, the script dynamically fetches the HPA settings and connection pool configuration. It calculates the maximum number of theoretical connections and compares it with the live database limits for a given service. If the math predicts potential exhaustion, the check fails, showing a real risk.
That’s why automations here are game-changing; both to simply reduce toil (manual, repetitive work) and to ensure proper visibility into potential issues without constantly reviewing constantly changing systems.
The Service Scorecards
To achieve a perfect experience, the Developer Experience (DevEx) team crafted a framework to integrate these automations directly into our Internal Developer Platform (IDP), introducing the concept of Service Scorecards. The idea was simple: whenever service owners or leadership visited a service catalog page, they were immediately drawn to the potential risks associated with that service. The scoreboards served two different audience views:
- Software Engineers get technical clarity: They understand exactly what is wrong (for instance: “Your connection pool is too high for your max replicas”) and how to fix it.
- Leaders get a risk assessment: They can see if their groups are safe and protected against known issues.

Over time, the reliability team also noticed a need for some deep-level check tooling. For instance, answering PRC points on simple things based on DataDog metrics was relatively easy, but how could they verify whether the service had undergone a Chaos Engineering session in the last quarter or two? Or was the team running too many manual operations in a given service?
We had a lot of questions like these, so we went to different groups to ask about teams’ toil, pain points, and risks within their given scopes (following Production Principles). Based on this, we began developing a vision for the solution, particularly for complex production checks.
The North Star: A Proactive Ecosystem
As we move forward, we are connecting all the puzzle pieces: Principles, Checklists, and Scorecards into a seamless, automated workflow. Our goal is to minimize risk not by slowing down, but by helping other engineering teams at every step.
This vision, illustrated below, relies on two key engines to speed up safe decision-making. Both are in-progress initiatives with enormous potential.
- Genesis (AI-Assisted Remediation in IDP): We transition from simply flagging risks to addressing and resolving them. Instead of just alerting on an unsupported Node version, Genesis offers a “Fix” button that automatically opens a Pull Request.
- Proactive Reliability Guardrails: We plan to integrate the Production Readiness Checklist into the code review process, as some validations based on the PRC are already in place, running, and protecting the resiliency of monday.com systems. Complex checks, such as the Connection Over-Provisioning logic, run automatically in the CI pipeline based on files that have been potentially changed. This saves the reviewer from manual math and prevents architectural issues from ever being introduced into the main branch.

Everything here is designed to minimize risks and prevent issues before they can occur in our production systems. To make safe, reliable decisions as fast as possible.
You Can’t Automate the Culture
Ultimately, all those automations and AI solutions serve as enablers for the engineering teams. The most significant here is the impact on mindset changes. Transitioning from reactive work to proactive, and then to preventive, is about creating an environment where engineers and leaders can rest well, without worrying about unexpected PagerDuty alerts disrupting their sleep in the middle of the night.
When I joined monday.com, I saw an R&D team ready to adopt these principles, not because they were forced to do so, but because they fully understood that reliability allows them to move faster and be more innovative, creating the best experience for their customers.
My biggest takeaway from this? You can automate the math and complex engineering problems, but you cannot automate the culture that has to be created, one principle at a time.


