Keeping Your Data Fresh: Optimizing Live Updates with Websockets
In this article, we’ll delve into a common challenge encountered by users on monday.com: ensuring up-to-date board views. When changes occur while a tab is inactive, users often face the frustration of outdated information. Today, we’re excited to share our innovative solution that addresses this issue head-on, delivering a seamless experience for monday.com users. Join us as we take you through the journey of overcoming this challenge and the significant improvements we’ve made to enhance the reliability of our platform. Together, let’s explore how we’ve tackled this issue and improved the user experience on monday.com.
Wait, What’s the Problem?
Many of our users frequently encounter a common issue known as “Stale” boards. This occurs when the data on their boards becomes out of sync between the client and server, causing them to see outdated information. This frustrating issue arises when a user opens a tab but changes occur while the tab remains inactive. Consequently, when the user returns to the tab, an outdated version of the board is displayed, and the changes made during the tab’s inactivity are not reflected. This occurrence is especially prevalent when users close their laptops for the night and later return to a tab with an open board.
Why Does This Happen?
At monday.com, we leverage Pusher, a powerful websocket-based communication layer, to deliver real-time data updates to our users. When user A makes changes to a board that user B is currently viewing, Pusher sends a corresponding event to user B, ensuring they receive the latest update. However, when Pusher events fail to reach the user’s device, data updates are not received, leaving the board view stale and out-of-date. Due to the limited retention of Pusher events (typically a few minutes), the tab has no means to “replay” the missed events. As a result, the only recourse for users in such scenarios has been to manually refresh the tab—a less than ideal and inconvenient solution.
As Sensitive as It Gets
Boards serve as vital entities within the monday.com ecosystem, and any modifications related to their loading process demand careful consideration due to the potential impact on various system components. The board loading flow is invoked numerous times throughout a user’s session and millions of times in a single day overall. Retrieving a board’s data represents one of the most resource-intensive operations within our system. In certain cases, even logging poses challenges due to the sheer volume of events, requiring the use of sampled logging. Recognizing these factors, we have adopted a cautious approach to any changes in this area, implementing controlled release mechanisms to ensure the utmost care and precision.
Detecting a Stale Board
Before we solve the problem, we first wanted to measure the scale of it. It would help us track the severity of the issue, while analyzing our solution’s impact. Our initial approach to detecting stale boards followed a specific flow:
- At the start of a new user session, we set a timeout for a random duration between 1 and 48 hours.
- When the timeout triggers, we gather the “comparable” data from the board, focusing on visible and easily serializable columns like numbers, text, and people.
- This data is then sent to the server, where it is compared with the latest information in our databases. The server then returns the result to the client.
- If the client identifies the data as out-of-sync, we meticulously track detailed events.
However, we soon realized that this approach was insufficient. If the timeout occurred while the tab was inactive, the callback would be immediately triggered upon tab reactivation. Consequently, regardless of the solution we employed, we would always face a stale board immediately after the tab had been inactive, as any implementation would have no chance to run during that period. To address this, we introduced a “backoff” mechanism. If the timeout expired later than expected, indicating tab inactivity, we would create a new timeout and restart the process.
After analyzing the collected data for a few days, we discovered that approximately 18% of user sessions experienced synchronization issues, resulting in boards going out of sync.
Triggers for Reloading the Data
After a lot of research, we decided that each of the following events should (potentially) cause a reload for the board’s data:
1. Network status changed: We utilized the browser’s built-in “online” event to detect when the browser switched from “offline” to “online” mode. This event indicated a possible data loss, and we have been addressing it for a while now.
2. Frozen tab detected: Many browsers freeze tabs that are not in use to save up memory and CPU. When a tab is frozen, no JS can run, which means that Pusher events are not processed, and that we don’t even know that we missed them. Since a lot of our users keep many monday.com tabs open for even weeks or months, our tabs get frozen frequently.
We found a method to identify frozen tabs. By setting a timeout of 10 seconds and checking the actual elapsed time in the callback, we can determine if the tab was frozen during that period if more than 10 seconds (+ epsilon) have passed.
By combining these two triggers, we achieved broad coverage of potential scenarios that could result in tabs going out-of-sync. Interestingly, only 3% of the reload triggers were attributed to network issues, which were already addressed before our recent improvements. The majority of reload triggers were due to frozen tabs, an area that we had yet to handle properly.
Triggers for reloading data
Executing the Reload
At this stage, we were confident in our ability to determine when to reload the board’s data. However, our initial implementation of reloading the board without optimizations caused a significant strain on our servers. With over 1 million additional reloads per day when the feature was open to only a fraction of our users, we quickly realized the need for improvements. Consequently, we closed the feature flag and set out to enhance the solution.
To address the server load issue, we introduced two key requirements for reloading the board:
- Tab visibility for at least 3 seconds: Since many monday.com users keep multiple tabs open for extended periods, it was crucial to prevent a scenario where a sleeping machine with 50+ tabs would flood the server with heavy data requests upon awakening, while the user only utilized a few tabs. By implementing this safety mechanism, we ensured that a tab would only be loaded once it became visible. As a result, more than 95% of reloads were effectively delayed. Within these reload delays, approximately 55% were delayed for more than 10 minutes, and around 21% were delayed by over 2 hours. This was a significant reduction of unnecessary data reloads.
Delays of waiting for active tab
- Server-side updates since the last sync: Our second requirement for reloading the board was to check if the server had any updates since the last synchronization. While this may seem straightforward, it’s important to consider monday.com’s advanced permissions mechanism. It is possible that missed updates could grant users permissions to access additional data. Therefore, we had to query the server to determine if the current user had missed any updates. If there are no updates since the last synchronization, we won’t reload the data. This measure successfully eliminated an additional 65% of redundant reloads.
Breakdown of data-fetch delays
After months of dedicated development and continuous iteration, we successfully addressed the issue of stale boards on monday.com. Through our efforts, we significantly reduced the occurrence of stale boards from 18% to an impressive 6.2%. This achievement has greatly improved the user experience.
The development journey was accompanied by further improvements that go beyond the scope of this post, demonstrating our ongoing commitment to enhancing monday.com. We are proud of the outcome, as it has made monday.com a more reliable and efficient tool for seamless team collaboration.