How we manage software infrastructure at monday.com
As you may know, at monday.com, we’re transforming the way people work by building a simple and intuitive SaaS software to connect teams around the world to their workplace processes, while improving collaboration and communication along the way.
Imagine you’re running a super successful company with 10 employees, 30 different clients, and 50 projects. As a manager, you decide to run your entire business (external projects and internal communication) on monday.com. An excellent choice if we do say so ourselves! 😉 monday.com then transforms into the core of your company’s operations. While this is just one example, it is the reality for the 35,000+ teams that rely on our platform. As a result, we at monday.com simply cannot afford outages or system downtime. So, we’re always working as hard as possible to provide the best user experience we can and constantly maintaining service uptime.
Now a little about our R&D department that is responsible for building this platform. As of now, we’re around 30 engineers, split into a few teams. All our engineers are true full-stack engineers. And when we say full stack, we don’t necessarily mean that they write backend and frontend code. It means they own a concept from thinking it through (together with product and UX experts), executing it (by writing the required code) and then deploying it (creating Infra and monitoring). As a full-stack at monday.com, you work end to end, owning the whole process. We want to maintain that culture for as long as we can.
At monday.com, we’re obsessed with transparency which helps us operate everything in a more visible way. When you visit the monday.com HQ, you will notice a lot of screens (over 70 to date!) showcasing data on dashboards. We have many dashboards that are unique to R&D. Dashboards help us show how many deployments we did each day, who is deploying right now, what’s failing, what our current test coverage is and much more.
We don’t have a formal QA process and we have a good suite of tests that allow us to safely do Continuous Delivery with about 20–30 deployment per day. We also have a proprietary A/B testing framework, backed by BigBrain, which also increases our confidence in making so much changes per day.
Although all engineers are capable of understanding and applying infrastructure changes at any given point, we have a dedicated infrastructure team that focuses on providing the right tools to maximize developer productivity and to build a framework to make infrastructure changes possible for any engineer across the company.
We have a 24 hour on-call rotation shared by all engineers. This means that each engineer must be capable of resolving any issue whether it’s a bug in the code or an infrastructure change that needs to be fixed. We invest a lot of time in education and training, as well as creating a wiki with different types of incidents and how to solve them.
The Infrastructure Team’s Main KPIs
Our infrastructure team is working to keep our developers focused on the work that really makes an impact for the company. It is the infrastructure team’s responsibility to ensure our developers waste as little time as possible creating a development environment, debugging, running their builds, or deploying changes to production. In other words, the team makes our engineers happy and allows them to work on moving the company forward at such a super fast pace.
We created a docker (and docker compose) based environment to help engineers to spin up the whole environment on their own laptops for development and testing. We are also always measuring time and working to improve our build and deployment pipelines. You can take a look at some interesting metrics in the dashboard below:
The infrastructure team works on determining the best way to develop the monday.com code, which components we should be using, how we share code between services, and how we share infrastructure between services and teams. We recently created a shared module of our own best practices to build a service. This module can be reused in all services we are developing. It includes our linter conventions, monitoring, logging, secret sharing, how we build new projects, test new projects and deploy new code.
Production Stability & Security
Another area that the infrastructure team works a lot on is focused on our production uptime, capacity planning, service resilience, performance testing and security.
We manage all production incidents in a dedicated monday.com board. Each time we detect an incident, we note it in this board. The information we update includes the incident title, who took care of it, the date it happened, time to resolution, root cause, which service was affected, incident severity, current status and more. It allows us to keep track of production incidents, implement action items, and to learn from our mistakes.
You can see this example in its entirely in one of our newest offerings, monday stories.
So, what kind of people are we looking for to join our infrastructure team?
We build our infrastructure solutions in the same way we build everything at monday.com. We’re looking specifically for full stack developers with a special passion for infrastructure. Like any engineer at monday.com, if we need to change the application code to comply with infrastructure changes, we do it ourselves without waiting for someone to do it for us.
If you’re a team player with strong communication skills, this just may be the team for you. You can see here the full job description for our current open role of an Infrastructure Engineer (SRE). We are always looking for new team members and if you’re excited by what you just read, we’d love to meet you!