Performance tasks and how to tackle them
The goal of this post is to share a bit of the knowledge we’ve acquired at monday.com on improving server side performance
The goal of this post is to share a bit of the knowledge we’ve acquired at monday.com on improving server side performance, so a I’ll give a bit of context about myself and monday.com and than we’ll dive in.
I’m leading the new server foundations team at monday.com. The team has a lot of responsibilities, one of which is performance improvements.
In recent years we have grown rapidly, with a year over year growth of 300%. This growth in the customer base came with a massive growth in the complexity of the product itself, creating some extreme scale challenges, you know, the fun kind 🙂
There were times we felt that each day had a new crisis, there was so much to do and it was hard to decide what to start from and what to focus on. We felt that we had the skills to solve the problems, but had difficulty in identifying and prioritizing them.
In this post I want to talk about how to tackle these challenges, focusing on the high level considerations needed, how to prioritize and what to measure.
First things first, how do we measure performance
At monday.com we believe in working with a data driven approach. You should be able to measure what you want to improve, set KPIs and improve them over time.
There are a lot of tools for measuring server performance, it’s important to choose and implement the ones that meet your needs.
We use New Relic as our main APM solution, getting an overall view of the system, measuring specific endpoints and monitoring database usage.
We also use Vividcortex for monitoring our main SQL database and for query analytics. Our production environment is deployed on AWS and we use most of the built in tools they offer like Cloudwatch and others. We also have a custom solution we built ourselves that measures specific user actions.
After you know what to measure, make it accessible. Add alerts and dashboards to view your main KPIs.
Now that we can measure, let’s talk about what we want to measure.
Measuring your app’s average response time
This is an easy to measure bottom line metric that’s affected by all parts of your app. Like the name suggests it’s the average for your entire system consisting of the average response time for each endpoint normalized by their throughput.
When your app hits its limits your average response time will usually rise considerably so it’s worth monitoring.
The problem is that it’s not always easy to improve. You can improve a specific endpoint response time by 300% and this metric won’t change at all. On the other hand you can add a lot of new calls to a fast endpoint and the average response time will improve even though we did no real impact for users.
So average response time is a good start but it’s not enough.
Databases need special attention
In a lot of applications the main bottleneck is the database. You can usually add more application servers easily but scaling the database is harder. Each type of database requires its own expertise and KPIs.
Once you find endpoints or processes that increase the load on your database it is worth improving them even if the direct customer impact is small.
Some insignificant feature running in an offline process can be the reason your entire app is slow if they are using the same database.
Not all flows were created equal
So we improved our average response and are monitoring our databases, what does this say about our user experience? Well, not much, we need to dig deeper to correlate performance and user experience. One thing we tried to do is define the main flows in the system and improve them.
Even if a certain endpoint does not cause request queueing or database load if it’s an important part of your app it might require advanced tuning.
For example our most important endpoint is the one that loads a board’s data. We added custom alerts, dashboards and multiple cache levels in to this endpoint in order to make sure all our customers get a fast experience.
Care about all your users, not just the average use case
We are now monitoring all major endpoints and are seeing a significant improvement in the average response time for each of them. Still there are more unknowns. When tuning specific endpoints it’s important to consider percentiles. You can measure the average response time, the median response time, the 90th percentile and 99th percentile.
If a feature is usually fast but doesn’t work at all for 1% of your users then there’s clearly a problem.
When improving an endpoint the first thing we focus on is the average response time as it affects most users. We also look at the 99th percentile, as this is still a large chunk of our traffic and should be performant. Lastly we look at the slowest responses we have. Sometimes it’s hard to fix them but they can expose bugs and edge cases.
First example is of an endpoint that behaves as expected, 99th percentile is around ~400ms while the average is ~150ms
But in the next examples we can clearly see a problem, some requests are just too slow and the variance in response time is just too big, pointing at bad limits on this endpoint.
Request queuing is a great proxy to your app’s health
All this while we ignored the correlation to an app’s traffic. Maybe the response time is great most of the day, but at peak traffic your app collapses.
When our app can’t handle our scale we start to receive request queueing, meaning that there are no free processes to handle user requests. This slows our app and if it increases too much we start losing requests and return error responses to our users.
Request queueing can happen for a lot of reasons, sometimes it’s from throughput spikes (can be prevented with throttling) or just a lack of attention to our natural growth and scale outgrowing our current system limits.
Monitoring request queueing emphasises problems that won’t be obvious from average response times percentiles correlating better to request throughput.
It’s important to be able to predict (even if only roughly) what scale your system can handle at any given moment and make sure it’s resilient enough that request spikes do not hurt the main functionality for most of your users.
Performance improvements are amazing, communicate what you do
Everyone agrees that performance improvements are important, but they can also be inspiring. Good communication inside the company can boost your R&D team’s technical abilities.
We use monday.com to write “Captain logs”, internal updates about new features and improvements. When we do performance improvements we communicate them in a similar manner.
Linking back to the KPIs that we defined we see how we improved them and explain the customer impact. This helps show the importance of these tasks and train the R&D on what to pay attention to.
You know you’ve succeeded if people want to ask you performance questions on new features and have ideas to improve existing features.
Performance tuning is crucial for any company, especially one in rapid growth. Knowing what you want to measure, measuring it, prioritizing and communicating are just as important as the technical skills to actually solve the problem.
I hope that this post helped in that respect, we plan to add some more technical follow up posts detailing the performance improvements that we did.
Want to join the fun? We’re hiring
The server foundations team’s goal is to help the company scale up while keeping it easy and fun for developers to add new features. Our extreme growth forces us to reinvent ourselves constantly, building new microservices, introducing new technologies and constantly improving performance.
If you love scale challenges and want to take part in taking an already high traffic high performance app to the next level then come join us. You can see all our positions on our website, so let’s be in touch 🙂