When Twenty-Five Deployments A Day Are Simply Not Enough
As our R&D grows, the more new features we added and the longer our release backlog became, the more we had to adapt and change the way we deliver our code to production. This strategy must be stable, frequent and automated.
That was the point when we had to rethink our process and try to reimagine our CI/CD process to make it more suitable for this new reality. We decided to implement deployment trains.
But first, let’s understand the evolution of our CI/CD.
At first, when we were only a few developers, and no infra team to manage the deployment process, developers would deploy manually from their laptops by running a deployment script on their local machine.
While our developers enjoyed having the power to release at any given time, we had a few problems using this strategy:
- Lack of transparency: except for the developer running the deployment, no one knew the status of deployments, is it currently running? Did the deployment fail?
- Developers deployed the code that was considered “master” at that point on their machine, instead of going through the pull request, code review and merge procedures.
- Tests could be bypassed.
- Running the deployment process simultaneously from multiple machines could set our production to an undefined state.
It’s a very common practice for small R&D teams, and allows the team to focus on writing code and not the process, but at a cost of a very fragile and complicated deployment.
So, we deployed manually for a while and it was pretty “funny” to see a developer sweating and forcing their laptop to stay awake for the process to work.
This is a member of our infra team
Although this worked for a while, we needed a process that would grow with the company, since the more developers we had the more frequently we wanted to deploy, and obviously, it became very risky.
At this point, we started to think about automating the process, and we implemented a CI / CD process.
We weighed our options to redesign the deployment strategy and eventually decided on a Blue/Green deployment strategy: with every new version, we’d create a completely new instance group, wait for them to build and pass a health check and basic functionality tests, and once they’re ready, replace the targets on our load balancer to the new instances.
The old instances would then become our rollback environment and by keeping those servers alive we could roll back our changes very quickly when needed.
This process works very well, it is fully automated, stable and now we can deploy frequently.
As we kept on growing, we started to see the downside of the process:
Our first problem is that we could roll back only one version. We couldn’t use the rollback because we already deployed twice between the introduction of a bug to its discovery, so the rollback becomes unusable.
The second problem is that most of the production deployment had only one merge to master, it wasn’t transparent enough for developers when and who merged to master.
So, de facto, there was a deployment to production at any given time, developers weren’t always aware when their code was released.
Given what we’d learned about the automated deployment process, we had to change the way we deploy while making the rollback process more resilient. This led us to eventually choose to work with deployment trains.
We decided that instead of requiring manual approval for deployments, we will schedule automatic deployments a few times per day, with each deployment pushing to production all merges to master since the last deployment.
Deploy Train Workflow
The first step for us towards that new strategy was to explore GitHub and find out our developers’ behavior during the day. We analyzed the merge patterns by time of day and decided on the hours of the train based on our data.
So now we have the hours and an automatic deployment train, each train will leave with at least one commit to master and sometimes even 20 at a time. This ensures that we have sufficient time to respond and use the rollback process if needed.
With these problems behind us, we were still facing the issues of communication and visibility for our developers about the deployment process in real-time. So we took it a step further and created a train dashboard, which shows when the next train leaves with a countdown timer, shows the schedule of upcoming train departures, which commits are waiting for deployment and when the train leaves. We also show, in real-time, the stages of deployment, from tests, build, deploy and teardown.
Deployment train dashboard
Deployment trains are great, and when you decide to implement a deployment train, follow these two points as general guidance to success.
- Error handling during the deployment stages is very important, you want to be on top of every failure during that process.
- Put visibility and communication as your top priority, it will save you a lot of time when developers can see all the data they need.
Found this interesting? Want to join us?
We build our infrastructure solutions in the same way we build everything at monday.com. We’re looking specifically for full-stack developers with a special passion for infrastructure. Like any engineer at monday.com, if we need to change the application code to comply with infrastructure changes, we do it ourselves without waiting for someone to do it for us.
If you’re a team player with strong communication skills, this just may be the team for you. You can see here all of our team’s open positions, which includes Development Experience Engineer, Infrastructure Engineer (SRE), Infrastructure Backend Engineer, Production Engineer and DBA.
If you’re excited by what you just read and our challenges, we will be happy to meet and share the knowledge. Let’s be in touch! 🙂