Infrastructure

monday.com’s Multi-Regional Architecture: A Deep Dive

Building a global SaaS platform requires lots of preparation, deep evaluation of your request routes and a truckload of R&D cooperation. Here's how we did it

Daniel Mittelman

Aug 10, 202113 min read

Preface

When making a decision to go multi-region, one needs to understand the primary motivation, as the work will vary greatly between performance-first, resilience-first and privacy-first designs. For example, can user data be stored in all regions, or only in one? how real-time is the system? what is the desired uptime SLA?

The monday.com platform was first launched about a decade ago, with focus on building a great product that people can use and collaborate on top for all of their day-to-day work needs. As demand grew, and larger and larger organizations began adopting the platform, we started getting requests that are more characteristic of the requirements of enterprises. Things like advanced security features, deeper and more diverse integrations to external systems, localization and more.

From day one, monday.com was deployed in a single geographical region: the US east coast. With more teams joining every day from all over the world, the demand to have the platform running in other regions in the world grew stronger, most prominently from our EU users. Some have shown interest in having the platform run “closer to home”, while others asserted that having their data stored in the US is an actual adoption barrier.

As we looked into running our platform in multiple regions, we identified three key motivations for doing so:

Performance
Running servers closer to the customer results in lower latencies, a better connection quality (especially in mobile) and generally a better experience for the end-user
Resilience
Running the system in multiple regions concurrently, when any region can serve the entire userbase at any given time, can increase the system’s overall fault-tolerance and reduce downtime
Privacy and compliance
Allowing customers to choose where in the world their data is stored creates a clear competitive advantage and opens the door to working with strictly-regulated industries like healthcare, banks and companies that process PII and sensitive information

When making a decision to go multi-region, one needs to determine the primary motivation, as the work will vary greatly between performance-first, resilience-first and privacy-first designs. For example, can user data be stored in all regions, or only in one? how real-time is the system? what is the desired uptime SLA?

Here at monday.com, we’ve gone with a privacy-first design since our market research showed that this is what our customers need the most. Sure, single-digit network latency is a bonus, but following the major changes in the last decade around privacy laws and data control, this is what the people want the most.

Design overview

Let’s begin by looking at the high-level overview of monday.com’s network from a single-region point of view:

Step by step, a request sent to monday.com goes through the following network path:

A user sends a request to a monday.com subdomain, which is proxied through Cloudflare for security purposes
Cloudflare send the request to an AWS Network Load Balancer
The LB forwards the request to a random node running the Ambassador Edge Stack, an Envoy-based L7 API Gateway
Ambassador passes the request through two filters – a Web Application Firewall and an Authentication Service
If the filters have both approved the request, Ambassador routes the request to the appropriate backend based on request matching patterns, e.g host, URI prefix, header value and more

Having an authentication service at the gateway, instead of delegating that responsibility to each backend individually, is an important component in any distributed web architecture. In our case, it also plays an additional role by identifying the user’s data region and allowing us to fix any routing issues at the gateway level. We’ll get to that later on

The challenges of adding a second region

At this point, we wanted to begin expanding our network to support multiple data regions, with the first additional region planned for deployment in Frankfurt, Germany. As a reminder, monday.com’s multi-regional architecture is based on the privacy-first design, so each region serves different customers and private user data is never shared between regions.

To build the multi-region design we had to solve a few challenges along the way:

Challenge #1 – subdomains are slugs, but not entirely

Subdomains under the monday.com domain are usually account-specific slugs. monday.monday.com is monday.com’s own monday.com account (very meta, I know) and all the user actions taken under that account are translated to HTTP requests to that host, while other accounts use their own subdomain.

For example,

POST https://monday.monday.com/boards/12345678

might result in a change done to board 12345678 under the monday account.

So, that sounds like an easy problem to solve – use DNS to route US customers to the US, and EU customers to the EU!

However, there are shared subdomains that are used by all customers, like api.monday.com which serves all the API calls to the platform, regardless of your account or region. So, how would a DNS response look in that case?

Another thing to remember is that monday.com has more than 100K paying organizations, so mapping each account to a different unique DNS record will become very expensive and non-scalable very quickly. In a system serving a few hundred customers DNS might be the preferred solution, but it will not work in a larger scale.

So, DNS can play a role here but is not the answer.

Challenge #2 – sometimes you don’t even know who the user is

Another interesting case is the login flow – an anonymous visitor arrives at our platform, types in their email and password, and is eventually routed to their account. To login, all users go to the same subdomain, https://auth.monday.com

In this case, which region processes the login flow? remember, we don’t even know who the user is until their identity has been established and an authentication token was issued.

Another case is webhook calls originating from customer integrations or our own third-party vendors. How would the third-party service know which region to call? true, in some integrations you can map a configuration to a region-specific URL, but in most of them that option does not exist and you will need to figure something out on your own.

Challenge #3 – staying as vendor-neutral as possible

There are a lot of ways to implement smart traffic routing using managed services. You could use Cloudflare Workers to analyze each request, determine its region and override its route target; AWS CloudFront with Lambda@Edge or Fastly’s Compute@Edge can also be used to achieve the same effect.

However, choosing any of these solutions means that you’re chaining your architecture to a single network service provider. If any of them experiences a major outage, you cannot recover without rebuilding your entire infrastructure in a way that allows multiple regions to co-exist for hundreds of thousands of accounts.

Challenge #4 – reproducibility

We wanted to build the new region in a way that can be replicated easily, so that the next region we deploy (if and when we choose to do so) will take significantly less time and effort to accomplish.

monday.com’s multi-regional design

After covering the challenges and the motivation, it’ll be easier to explain how we designed our multi-regional deployment:

We’ll dive deeper into how the communication takes place and how routing is implemented in the next few paragraphs, but let’s highlight a few key properties of this design:

Each region is independent – each region is equipped with the core infrastructure required to allow it to function independently and without effect from downtime of other regions. This is important for maintaining a higher uptime baseline and reducing the impact of a regional outage on all other users
The user can directly access each of the regions – this means that in obvious cases, like addressing the user’s own subdomain, we can optimize the network path and send the user directly to their data region if we want
Authentication is global – authentication metadata is replicated in real-time across all regions, so requests with an authentication token can enter the network from any region and be verified at the edge
Vendor-neutrality – Cloudflare does not play any key role in regional routing (it does play a role which we’ll touch on later). If Cloudflare is down, we can simply “switch them off” and the platform will remain online for all users in all regions!

We saw the single-region design earlier, so let’s look at how a more ambiguous request is handled:

Our network, in this case, received a HTTP request to api.monday.com with an attached authentication token in the Authorization header.

The request is sent by the European user to Cloudflare
We don’t know who the user is at this point, so Cloudflare sends the request to the US by default
The request is forwarded to Ambassador
Ambassador sends the request headers to be processed by the two filters. Authentication Service, as part of its job, determines who the user is and in which region their account is hosted. The service “tags” the request as belonging to the EU and returns a response to Ambassador
Ambassador detects that the destination region is not the current region; based on the destination region, it proxies (hands over) the request to Ambassador in the correct region
Ambassador in the EU region verifies the request again, and sees that it is now in the correct region
The request is finally forwarded to the corresponding backend

Let’s zoom into the network access layer:

Traffic always enters the network from one of the front-facing load balancers (there’s one in each region) and goes into Ambassador. Whenever traffic is tagged as belonging to another region, Ambassador sends the traffic to a special route called a regional handover route. That type of route translates to the address of an internal Network Load Balancer that is accessible over a private peered connection, allowing each Ambassador cluster to proxy requests to clusters in other regions.

Configuring Ambassador to support multi-region

Let’s dive one step deeper and look at the configuration that ties our multi-regio nal network, Authentication service and API structure.

Ambassador uses Kubernetes objects called Mappings, which are simplified bindings between Envoy virtual host routes and clusters (meaning “backends”). Ambassador also exposes a property called precedence which allows the engineer to explicitly declare a routing rule more prioritized than other rules.

Our route mapping hierarchy is built like so:

Every Ambassador node evaluates requests based on these rule groups in that order. If a request is determined to be away from its target region, Ambassador will proxy it to the Ambassador node in the correct region and stop evaluating all the other rules; otherwise, regular prefix-based or host-based routing will take place.

Here’s an example of a regional handover rule that is configured in our US region, used to route requests to the EU:

apiVersion: getambassador.io/v2
kind: Mapping
metadata:
  name: route-to-frankfurt
spec:
  precedence: 100
  add_response_headers:
    x-routed-from: us
    x-routed-to: eu
  connect_timeout_ms: 6000
  cluster_idle_timeout_ms: 300000
  idle_timeout_ms: 300000
  timeout_ms: 60000
  headers:
    x-region-target: eu
  prefix: /
  service: https://frankfurt.internal.nlb.address

Let’s review some of the important things here:

The rule is applied whenever a request with the request header x-region-target is matched, such that the region target is different than the current region. This is how the authentication service “tags” the region on the request. In the US for example, we’ll have only one rule – match x-region-target: eu
We use a high precedence value to prioritize this rule over other rules
We inject two response headers: x-routed-from and x-routed-to. This allows us to know when a request has been rerouted to another region
We override cluster_idle_timeout_ms and idle_timeout_ms and set them to 300 seconds. AWS NLBs enforce a 350-second idle timeout on connections and then terminate them silently (without sending an RST packet), causing an issue where a connection that remains unused for a long time will still be considered active by Ambassador, even though it was closed. Changing these parameters fixes that issue, which otherwise would result in intermittent HTTP 503 responses from the gateway

Application deployment in a multi-regional world

Once we have multiple regions online, we need to decide how we deploy the different ecosystem components that used to run in a single region, so that they’re available in all regions.

We support three deployment topologies:

Regional applications are deployed fully into every region as an independent component. The US deployment of a microservice operates separately from the EU deployment, they do not share any data and they’re not even aware of the existence of one another.
Global applications are deployed once in the US, and are accessible over the internal network from all regions. These usually serve for non-critical, metadata-only uses that the system can temporarily operate without.
Replicated applications are deployed fully into every region, but share a single replicated database so their state is identical in all regions. These serve for latency-sensitive critical applications, like the Authentication service, but do not process private user content

As a general rule, we encourage our developers to opt for the regional deployment topology whenever possible, even though the cost is higher. In cases where private user content is not stored, or the service is not critical to the operation of the entire platform, we allow developers to choose other topologies according to their specific needs.

Logging and monitoring

We have several tools that we use to monitor our global deployment:

Envoy access logs: We stream all the access logs from all regions to a centralized logging system. Since Ambassador is built on Envoy, we customized the Envoy access log format to include additional information like the authenticated user ID, regional information and Cloudflare’s Ray ID. These allow us, for example, to aggregate logs on a per-region basis and identify user activity at the gateway access log level.

Prometheus/Grafana: We collect the Envoy metrics from Ambassador and store them in Prometheus. That data, combined with access log data, allowed us to build a dashboard which displays real-time and historic operational insights about Ambassador across our entire infrastructure, things like traffic volume, ingress throughput (in Mbps/Gbps), authentication success/reject ratio, HTTP status code distribution, average compression ratio and more:

AWS CloudWatch: Since our entire infrastructure is built on top of the AWS cloud, we use Transit Gateway, VPC and NLB metrics to keep track of the internal network activity, identify how much of the data goes through regional handover and ensure it is kept to a minimum. We also keep VPC Flow Logs for all VPCs to analyze specific paths that might require further diagnostics down the road.

Summary

Expanding your deployment to multiple regions has a myriad of benefits, from increased control over where user content is stored to better fault tolerance to lower latency for the end-user. The most important thing when setting out to adding more regions is to decide why we want to go this way, and who is the main beneficiary. For us at monday.com, we went for the privacy-first approach, and built each region to operate individually and host the content of different users.

After almost a year of work, designing and redesigning, our multi-regional architecture is up and running and fully operational, and this approach has proven itself in the face of the high throughput and scale of the monday.com platform.