How, and why, we scaled up to a Multi-DNS architecture (Part 3)
Infrastructure

How, and why, we scaled up to a Multi-DNS architecture (Part 3)

This is the third and final part of a 3-part series and deals with the actual migration process to Multi-DNS. To read from the beginning, go to part 1

Daniel Mittelman
Daniel Mittelman

This is where we start to talk about the solution, and the key aspects needed to plan for a successful migration, both out of an existing provider and into multiple providers.

There are five aspects to a successful migration which we’ll discuss here:

  1. Syncing your team and restricting manual DNS updates
  2. Using Infrastructure-as-Code tools for zone management
  3. Creating your own recursive DNS server for testing
  4. Creating a playbook for your migration
  5. Dry-running the playbook in a test environment

1. Syncing your team and restricting manual DNS updates

This part may sound trivial, but make sure your team (and generally everyone who makes DNS changes to your domain) is aware of your migration. As you export your zone in preparation for migrating to the new provider, make sure that no one makes manual changes to your existing DNS zone, as your changes will not mirror the existing DNS configuration.

That thing you used to do? don’t

As a safeguard, instruct your co-workers to transfer any DNS change request to you. Then, if possible, restrict their permissions on your provider’s website and API to make sure changes are not made without notifying you. If you have any automatic processes that perform DNS record updates as part of their execution, make sure they’re doing so for your new provider as well.

2. Using Infrastructure-as-Code tools for zone management

For many DevOps teams, DNS management is usually accomplished by updating a web UI. Similar to managing your infrastructure manually, this means that changes to DNS may not be documented or audited; they do not go through the same “code review” process as code goes before approval, and changes may lead to an inconsistent state when they’re not managed from a single point.

Updating DNS through the provider’s UI. Easy but not sustainable

While the popular Infra-as-Code tool Terraform does support a variety of DNS providers, we’ve decided to use OctoDNS for that purpose.

OctoDNS is a relatively new open-source tool, maintained by GitHub, that is built on a simple premise: DNS state is provided from a single source and is then applied to one or more destinations. As of today, it supports 16 managed providers, as well as reading from standardized zone files

The source can either be a YAML file, a DNS provider or a BIND zone file. The target/s can be any of the former two. This flexible model allows teams to use OctoDNS in order to apply DNS changes to multiple providers by managing records in a local, git-managed file, or as a synchronizer between two DNS providers that do not support AXFR transfers.

Two relevant tools that OctoDNS provides are octodns-dump and octodns-sync. As one might expect, the first is used to dump an existing zone from a managed provider to a YAML file, and the second is used to apply changes from a source to the targets.

To dump your existing zone records to a local file, we’ll create a config_cf.yml file that defines Cloudflare as our data source and a local YAML file as the target:

providers:
  config:
    class: octodns.provider.yaml.YamlProvider
    directory: ./domains-cf
    default_ttl: 1800
  cloudflare:
    class: octodns.provider.cloudflare.CloudflareProvider
    email: env/CLOUDFLARE_EMAIL
    token: env/CLOUDFLARE_GLOBAL_API_KEY

zones:
  monday.com.:
    sources:
      - config
    targets:
      - cloudflare
Use environment variables, preferably backed by an encrypted local store, to save your provider’s credentials

Then run the following command:


$ octodns-dump \
--config-file config-cf.yml \
--output-dir domains-cf \
monday.com. cloudflare

The output of this command will be a new file located at domains-cf/monday.com.yaml . The file comprises a map of subdomains and their corresponding records, for example:

# Zone apex
? ''
: ttl: 1
  type: ALIAS
  value: abcdefghijklmn.cloudfront.net.
  octodns:
    cloudflare:
      proxied: true

# Subdomain with A and MX records
email:
  - type: A
    ttl: 3600
    value: 50.24.168.33
  - type: MX
    ttl: 7200
    values:
    - exchange: aspmx.l.google.com.
      preference: 1
    - exchange: alt1.aspmx.l.google.com.
      preference: 5
OctoDNS’s config file groups records under subdomains, making it easier to navigate through

OctoDNS has a generic record structure, which is made of the TTLrecord type and value(s) for the record. In addition, provider-specific configuration is available (in this example, whether the record is reverse-proxied by Cloudflare’s edge network or not).

To load your zone into a new provider (or multiple providers), create a new configuration file called config_dns.yml which lists your providers and designates your local zone file as the source:

providers:
  config:
    class: octodns.provider.yaml.YamlProvider
    directory: ./domains-dns
    default_ttl: 1800

  ns1:
    class: octodns.provider.ns1.Ns1Provider
    api_key: env/NS1_API_KEY

  constellix:
    class: octodns.provider.constellix.ConstellixProvider
    api_key: env/CONSTELLIX_API_KEY
    secret_key: env/CONSTELLIX_SECRET_KEY

zones:
  monday.com.:
    sources:
      - config
    targets:
      - ns1
      - constellix

We used NS1 and Constellix, but you can use any supported provider

We then create a new directory called domains-dns , copy our monday.com.yaml file into it and make any necessary changes, for example removing provider-specific keys.

All that’s left is to run:




octodns-sync \
--config-file config-dns.yml
monday.com.



To see the diff between the local file and remote providers (similar to terraform plan ), and then run:




octodns-sync \
--config-file config-dns.yml
monday.com.
--doit



To apply the changes.

From that point on, any DNS changes should be made to the local YAML zone file only, and then applied to both providers through a single command!

While it would be possible to set your existing provider as the source and new provider as the destination, we recommend going through the local YAML step in order to allow managing that zone with OctoDNS in the future, and cement it as the source of truth for the zone from now on.

But why two configuration files? (config_cf.yml and config_dns.yml )

We still want to use Cloudflare, just not as a DNS provider. Pro and Enterprise customers can switch Cloudflare to a mode called CNAME Setup (not covered in this post), which allows configuring just the records that need reverse-proxying and have your DNS point to them using CNAME records. This essentially creates two zone configurations we need to manage, but using OctoDNS it’s pretty easy and straightforward.

3. Create your own recursive DNS server for testing

Consider the following scenario: you have a multi-AZ, multi-region deployment. Two of your microservices, each deployed in another region, exchange data over the public Internet… How do you verify that nothing breaks inside your distributed environment?

A popular way to test your new DNS configuration, before switching to the new provider, is using commands like dig to query your new DNS server.

For example, to verify that our root domain’s A record is correct:




dig @8.8.8.8 A monday.com
dig @dns1.p01.nsone.net A monday.com



Where 8.8.8.8 is one of Google’s Public DNS servers and will respond with the current record value.dns1.p01.nsone.net is the new authoritative DNS server we’re experimenting with. All we have to do is make sure the answers are equal after loading the zone into the new provider.

This is all good in theory, however when your zone is complex and your applications may internally depend on it, testing becomes harder. It’s also worth mentioning at this point that transferring DNS out of Cloudflare requires switching off all reverse proxy capabilities during the migration, so any capabilities such as HTTP/2, HSTS and HTTP → HTTPS rewriting will not work unless your backend web server also supports them.

Consider the following scenario: you have a multi-AZ, multi-region deployment. Two of your microservices, each deployed in another region, exchange data over the public Internet, and DNS resolution takes place using public resolvers. How do you verify that nothing breaks inside your distributed environment?

To simulate a scenario where we already switched to our new DNS provider, we’ll create our own DNS server using the open-source Bind9 project. Our custom DNS server would reply to any query the same way any DNS server would (including recursive resolution), with the exception of our domain, in which case queries would be routed to the new provider.

To create and use a dedicated recursive DNS server for testing

Note: this guide assumes you’re using an Amazon Linux EC2 instance, however it should work on any Linux distribution, and any cloud provider, with minor changes.

We begin by launching an Amazon Linux EC2 instance (t3.micro should be sufficient), and configuring it to receive inbound DNS traffic, and of course SSH

Remember to limit inbound access to your IPs only, if possible

 

SSH into the machine and install some required packages:


$ sudo yum update
$ sudo yum install bind bind-utils

Next, we’ll edit the /etc/named.conf file.

In the options stanza, change/add the following configuration:




listen-on port 53 { any; };
allow-query { 0.0.0.0/0; };
forwarders { 8.8.8.8; };
recursion yes



  • The listen-on directive binds the listener to all interfaces
  • The allow-query directive allows incoming requests from anywhere
  • The forwarders directive defines the upstream DNS servers of our nameserver
  • The recursion directive ensures that the server acts as a “regular” DNS server

Security note: DNS servers should never be left open to the world without rate limiting, since they may be used for DDoS attacks with DNS amplification. Since we’ve configured our instance’s Security Group to allow access from our IPs only this is fine, but if you cannot limit by IP add the following directive:




rate-limit { responses-per-second 15; };



We then add the following to the bottom of the file:




include “/etc/named/mondaycom.zone”;



And create the referenced file in the include directive. Paste the following zone definition into that file:




zone "monday.com" {
type forward;
forwarders { 198.51.44.7; };
};



This zone definition overrides the configured upstream (in our case 8.8.8.8 ) whenever a record in our zone is queried, such that queries for our domain will use one of the nameservers of our new DNS provider (dns1.p01.nsone.net = 198.51.44.7). Replace that value with the IP address of your provider.

Save the file, then apply your changes by running:

[sourcecode

We have our very own recursive DNS resolver for testing, which simulates a DNS server after we’ve completed the migration! Go ahead and set that DNS server as your primary (and only) DNS server on your workstation (WindowsMac) and on your testing environment.

Commands like dignslookup and host will now resolve through our brand new DNS server. Your computer should keep working as usual, resolving other domains as if you’ve configured 8.8.8.8 as your DNS server, except for your domain.

4. Create a playbook for your migration

As with any major upgrade or migration to a production system, it’s imperative to have a detailed playbook that lists the steps of the migration, checkpoints and fallback plan.

In a DNS migration plan it’s critical to remember several things:

  1. Your first major checkpoint should be getting all your DNS providers (old and new) to return the exact same answers for all queries. In the case of using Cloudflare, this involves turning off the reverse proxy (“orange cloud”) for all records as they modify the actual response.
  2. The TTL of NS records is almost always 48 hours, meaning that DNS servers around the world will take at most 48 hours to clear their internal cache and get your new DNS servers. To ensure consistency, plan for a propagation time of 48 hours and do not make any changes to records during that timeframe.
  3. Monitor the progress of your propagation using tools like dnschecker.org and whatsmydns.net. You will notice that most servers will resolve to your new provider with 1–2 hours, but do not rush and wait until you see 100% coverage.
  4. If your DNS provider allows you to assign your own TTL for NS records, add a step to significantly decrease that TTL before the migration, wait the original TTL before beginning the migration and increase it back up after the process completes. Remember that the lower the TTL is, the faster you can recover from a problem. When you’re done, set the NS TTL back to its original value.

5. Dry-run your playbook on a test environment

If you have a test/staging environment that is configured the exact same way as your production environment, run your playbook in full on that environment first.

It’s important to monitor your logs and look for new errors, however it’s also important to monitor the amount of traffic going into the environment to verify that there are no DNS resolution errors that cause your environment to become unreachable.

Thanks to Dan Ofir.