How, and why, we scaled up to a Multi-DNS architecture (Part 1)
This is the first of a 3-part series and lays out the background and reasoning for making such a transition. Part 2 deals with the technical background of Multi-DNS and its strategies, and part 3 deep-dives into actually making the transition
With no other viable option, we were forced to wait out the storm and have our platform suffer a 30-minute downtime during rush hour, which by our standards is completely unacceptable.
Don’t get me wrong: Cloudflare is probably one of the best, if not the best, DNS provider in the world. DNS comparison sites, like dnsperf.com, feature Cloudflare as a managed DNS service provider with very low latency worldwide and excellent uptime, and are often listing them in first place even when compared to other large and well-known providers!
As long time Cloudflare customers, leveraging several of their services to protect our platform, primarily their reverse proxy services including DDoS mitigation and content-based filtering, we were happy to use Cloudflare as our DNS provider. Then, on July 2, 2019, something has changed.
Tuesday. 3PM local time. This is roughly the time of day we see incoming traffic into monday.com climbing towards its daily peak. In fact, here’s how an average day looks if you’re a monday.com application load balancer:
A few minutes go by and our phones start ringing like crazy. “monday.com is down”, the alarm proclaims, and we scramble to our computers to check what just happened. And indeed, instead of the application, we see the dreaded HTTP 502 Bad Gateway screen.
Reviewing our general deployment revealed that everything is working on our side. Servers? up. Load balancers? doing their jobs. Firewalls? Lambda functions? all up and running.
Huh.
The next logical theory is that there’s something wrong in the perimeter, in our case Cloudflare. A quick peek in their status page revealed that they’re “…observing network performance issues…” and that “Customers may be experiencing 502 errors while accessing sites…”
This is not a new phenomenon and happens from time to time. We know that when this happens, in extreme cases where automatic rerouting does not work, we just go to Cloudflare and turn off their proxy capabilities in order to have monday.com and its subdomains resolve directly to our load balancers and gateways. Not optimal, but better than a self-inflicted denial of service, right?
But, as hinted in the premise, this time was different: the Cloudflare portal, alongside their API, were down as well. We (alongside tens of thousands of other customers) were completely locked out of the traffic management system and could not reroute our users’ traffic to our systems!
With no other viable option, we were forced to wait out the storm and have our platform suffer a 30-minute downtime during rush hour, which by our standards is completely unacceptable.
We then decided, in the wake of this incident, that it is imperative for us to introduce separation of concerns when it comes to our reverse proxies and the DNS that points to them. It was time to move DNS to a separate infrastructure, while ensuring its resilience against a multitude of issues and attacks.
In order to improve DNS management capabilities, and make them independent of other network layers, one must accept a few axioms:
First and foremost: DNS and reverse proxies must be managed by different service providers, and be hosted on different physical infrastructure. This is where the separation of concerns principle manifests itself, and is relevant even if you’re hosting your own infrastructure on-premise.
The second one is a well-known truth to DNS server operators: DNS networks are constantly under attack. These attacks range from DNS poisoning attempts, through amplification attacks that make use of unpatched servers to target another machine, and all the way through DDoS attacks against the DNS provider itself.
Equipped with these basic invariants, we set out to plan a migration of our authoritative DNS servers from Cloudflare to another managed provider, and increase redundancy by adding a second DNS provider.
Continue to part 2 to get some key technical background, or skip ahead to part 3 where we’ll talk next about how we’ve done this with confidence, and what crucial lessons we’ve learned from the process.