Project Torus — multi region availability kickoff
Infrastructure

Project Torus — multi region availability kickoff

David Virtser
David Virtser

We are growing and our customers are evolving. Over the last year we got more requests from enterprise customers for availability of their data in their closest region. This could be as a result of local regulations or specific business requirements.

This year we decided that it’s about time to put our focus on multi region availability efforts. Because we care about our customers and want to give them the best possible experience of monday.com in terms of latency and high availability.

How do you approach such a project?

What are the solution requirements? Where do we start? How should we do it? We had more questions then answers at first stage as it’s a huge project which affects literally everyone in the company. We first need to find answers to these questions to better scope the project.

Business aspects

We must understand what our clients need to better define the project requirements. Are they looking for their data to be persistent in their local region? Or maybe they care more about latency to our service and they don’t care where their data resists? Which region (in Europe or Australia for example) should be opened first? It could be a strategic decision as well.

In order to get answers to these questions and define our requirements, we want to analyze our historic data of failed deals and understand better what where customer requests.

Product aspects

After we know what the basic requirements are, we need to understand how we change our product to support multiple regions.

Possible options:

  1. Account is created and working against region X. All data resists in that region and all communication is within that region. Implication: if users travel they might suffer from increased latency to monday.com services.
  2. Users are routed to the closest region by their geo location (IP). Implication: The data persistence requirement will not be met.

Do we provide multi region support for all types of accounts or is it relevant for enterprise accounts only?

What if an account wants to migrate from region X to Y? Do we need to support that use case from day one?

How do we onboard a new account, is it to a fixed region we decide or by their closest location?

To answer these questions we decided to meet other companies similar to our size and product offering to understand how they did it and possibly learn from their mistakes.

Engineering aspects

After we gather all the basic business requirements and decide how to change our product to support multi region offering we need to break down all the engineering aspects.

The questions we ask ourselves are:

  • How does user authentication and authorization need to be changed to support another region?
  • How do we do our data sharding? We are using MySQL, Elasticsearch and Redis databases. We have customer files saved in S3 buckets.
  • How does it change our DNS records management and how does our internal Network need to change?
  • What infra automation we need to do to now to support such deployments and how do we operate it in the future so it will make sense in terms of time management?
  • As we run on AWS cloud, is there any technology that can help us with multi region deployment?
  • Which services are relevant and which are not for multi region availability?
  • We need to review a list of all 3rd party SaaS services that we are using (like email providers for example) to understand if they have support for multi region availability.
  • How does it change our monitoring and logging infrastructure to support observability?
  • What about Security, Disaster Recovery and much much more…

You need to be very focused here and better define Goals and Non goals list so you can take some efforts off the shelf and put them aside (at least for phase I) to minimize the scope of this huge project.

Our plan

Research (Q1 2020)

We decided to invest the whole Q1 in research to be able to better define the scope of this project and get answers to all our questions.

Below is a list of posts and talk we found which was useful for our research:

https://read.acloud.guru/why-and-how-do-we-build-a-multi-region-active-active-architecture-6d81acb7d208
https://medium.com/@roshanpaiva/moving-to-a-multi-region-active-active-architecture-on-aws-2055e4408240
https://www.youtube.com/watch?v=2e29I3dA8o4&feature=youtu.be
https://www.youtube.com/watch?v=RMrfzR4zyM4&feature=youtu.be
https://aws.amazon.com/blogs/apn/architecting-multi-region-saas-solutions-on-aws/
https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b
https://netflixtechblog.com/global-cloud-active-active-and-beyond-a0fdfa2c3a45
https://www.atlassian.com/blog/technology/aws-scaling-multi-region-low-latency-service

Implementation (Q2-Q4 2020)

This phase is going to be defined after we finish with research…

Want to join us?

Do you find this challenge fascinating? Have you done it before?
If you’re a team player, this opportunity may be just for you! Apply here