Short DNS Record TTL And Centralization Are Serious Risks For The Internet
Yesterday Dyn, a DNS-provider, went down after a massive DDoS. That led to many popular websites being inaccessible, including twitter, LinkedIn, eBay and others. The internet seemed to be “crawling on its knees”.
We’ll probably read an interesting post-mortem from Dyn, but why did that happen? First, DDoS capacity is increasing, using insecure and infected IoT devices with access to the internet. Huge volumes of fake requests are poured on a given server or set of servers and they become inaccessible, either being unable to cope with the requests, or simply because the network to the server doesn’t have enough throughput to accomodate all the requests.
But why did “the internet” stop because a single DNS provider was under attack? First, because of centralization. The internet is supposed to be decentralized (although I’ve argued that exactly because of DNS, it is pseudo-decentralized). But services like Dyn, UltraDNS, Amazon Route53 and also Akamai and CloudFlare centralize DNS. I can’t tell how exactly, but out of the top 500 websites according to Moz.com, 181 use one of the above 5 services as their DNS provider. Add 25 google services that use their own, and you get nearly 200 out of 500 centered in just 6 entities.
But centralization of the authoritative nameservers alone would not have led to yesterday’s problem. A big part of the problem, I think, is the TTL (time to live) of the DNS records, that is – the records which contain the mapping between domain name and IP address(es). The idea is that you should not always hit the authoritative nameserver (Dyn’s server(s) in this case) – you should hit it only if there is no cached entry anywhere along the way of your request. Your operating system may have a cache, but more importantly – your ISP has a cache. So the idea is that when subscribers of one ISP all make requests to twitter, the requests should not go to the nameserver, but would instead by resolved by looking them up in the cache of the ISP.
If that was the case, regardless of whether Dyn was down, most users would be able to access all services, because they would have their IPs cached and resolved. And that’s the proper distributed mode that the internet should function in.
However, it has become a common practice to set very short TTL on DNS records – just a few minutes. So after the few minutes expire, your browsers has to ask the nameserver “what IP should I connect to in order to access twitter.com?”. That’s why the attack was so successful – because no information was cached and everyone repeatedly turned to Dyn to get the IP corresponding to the requested domain.
That practice is highly questionable, to say the least. This article explains in details the issues of short TTLs, but let me quote some important bits:
The lower the TTL the more frequently the DNS is accessed. If not careful DNS reliability may become more important than the reliability of, say, the corporate web server.
The increasing use of very low TTLs (sub one minute) is extremely misguided if not fundamentally flawed. The most charitable explanation for the trend to lower TTL value may be to try and create a dynamic load-balancer or a fast fail-over strategy. More likely the effect will be to break the nameserver through increased load.
So we knew the risks. And it was inevitable that this problematic practice will be abused. I decided to analyze how big the problem actually is. So I got the aformentioned top 500 websites as representative, fetched their A, AAAA (IPv6), CNAME and NS records, and put them into a table. You can find the code in this gist (uses the dnsjava library).
The resulting CSV can be seen here. And if you want to play with it in Excel, here is the excel file.
Some other things that I collected: how many websites have AAAA (IPv6) records (only 79 out of 500), whether the TTLs betwen IPv4 and IPv6 differ (it does for 4), which is the DNS provider (which is how I got the figures mentioned above), taken from the NS records, and how many use CNAME instead of A records (just a few). I also collected the number of A/AAAA records, in order to see how many (potentially) utilize round-robin DNS (187) (worth mentioning: the A records served to me may differ from those served to other users, which is also a way to do load balancing).
The results are a bit scary. The average TTL is only around 7600 seconds (2 hours and 6 minutes). But it gets worse when you look at the 50th percentile (sort the values by ttl and get the lowest 250). The average there is just 215 seconds. This means the DNS servers are hit constantly, which turns them into a real single point of failure and “the internet goes down” just after a few minutes of DDoS.
Just a few websites have a high TTL, as can be seen from this simple chart (all 500 sites are on the X axis, the TTL is on y):
What are the benefits of the short TTL? Not many, actually. You have the flexibility to change your IP address, but you don’t do that very often, and besides – it doesn’t automatically mean all users will be pointed to the new IP, as some ISPs, routers and operating systems may ignore the TTL value and keep the cache alive for longer periods. You could do the round-robin DNS, which is basically using the DNS provider as a load-balancer, which sounds wrong in most cases. It can be used for geolocation routing – serving different IP depending on the geographical area of the request, but that doesn’t necessarily require a low TTL – if caching happens closer to the user than to the authoritative DNS server, then he will be pointed to the nearest IP anyway, regardless of whether that values gets refreshed often or not.
Short TTL is very useful with internal infrastructure – when pointing to your internal components (e.g. a message queue, or to a particular service if using microservices), then using low TTLs may be better. But that’s not about your main domain being accessed from the internet.
Overlay networks like BitTorrent and Bitcoin use DNS round-robin for seeding new clients with a list of peers that they can connect to (your first use of a torrent client connects you to one of serveral domains that each point to a number of nodes that are supposed to be always on). But that’s again a rare usecase.
Overall, I think most services should go for higher TTLs. 24 hours is not too much, and it will be needed to keep your old IP serving requests for 24 hours anyway, because of caches that ignore the TTL value. That way services won’t care if the auhtoritative nameserver is down or not. And that would in turn mean that DNS providers would be less of an interesting target for attacks.
And I understand the flexibility that Dyn and Route53 give us. But maybe we should think of a more distributed way to gain that flexibility. Because yesterday’s attack may be just the beginning.
Yesterday Dyn, a DNS-provider, went down after a massive DDoS. That led to many popular websites being inaccessible, including twitter, LinkedIn, eBay and others. The internet seemed to be “crawling on its knees”.
We’ll probably read an interesting post-mortem from Dyn, but why did that happen? First, DDoS capacity is increasing, using insecure and infected IoT devices with access to the internet. Huge volumes of fake requests are poured on a given server or set of servers and they become inaccessible, either being unable to cope with the requests, or simply because the network to the server doesn’t have enough throughput to accomodate all the requests.
But why did “the internet” stop because a single DNS provider was under attack? First, because of centralization. The internet is supposed to be decentralized (although I’ve argued that exactly because of DNS, it is pseudo-decentralized). But services like Dyn, UltraDNS, Amazon Route53 and also Akamai and CloudFlare centralize DNS. I can’t tell how exactly, but out of the top 500 websites according to Moz.com, 181 use one of the above 5 services as their DNS provider. Add 25 google services that use their own, and you get nearly 200 out of 500 centered in just 6 entities.
But centralization of the authoritative nameservers alone would not have led to yesterday’s problem. A big part of the problem, I think, is the TTL (time to live) of the DNS records, that is – the records which contain the mapping between domain name and IP address(es). The idea is that you should not always hit the authoritative nameserver (Dyn’s server(s) in this case) – you should hit it only if there is no cached entry anywhere along the way of your request. Your operating system may have a cache, but more importantly – your ISP has a cache. So the idea is that when subscribers of one ISP all make requests to twitter, the requests should not go to the nameserver, but would instead by resolved by looking them up in the cache of the ISP.
If that was the case, regardless of whether Dyn was down, most users would be able to access all services, because they would have their IPs cached and resolved. And that’s the proper distributed mode that the internet should function in.
However, it has become a common practice to set very short TTL on DNS records – just a few minutes. So after the few minutes expire, your browsers has to ask the nameserver “what IP should I connect to in order to access twitter.com?”. That’s why the attack was so successful – because no information was cached and everyone repeatedly turned to Dyn to get the IP corresponding to the requested domain.
That practice is highly questionable, to say the least. This article explains in details the issues of short TTLs, but let me quote some important bits:
The lower the TTL the more frequently the DNS is accessed. If not careful DNS reliability may become more important than the reliability of, say, the corporate web server.
The increasing use of very low TTLs (sub one minute) is extremely misguided if not fundamentally flawed. The most charitable explanation for the trend to lower TTL value may be to try and create a dynamic load-balancer or a fast fail-over strategy. More likely the effect will be to break the nameserver through increased load.
So we knew the risks. And it was inevitable that this problematic practice will be abused. I decided to analyze how big the problem actually is. So I got the aformentioned top 500 websites as representative, fetched their A, AAAA (IPv6), CNAME and NS records, and put them into a table. You can find the code in this gist (uses the dnsjava library).
The resulting CSV can be seen here. And if you want to play with it in Excel, here is the excel file.
Some other things that I collected: how many websites have AAAA (IPv6) records (only 79 out of 500), whether the TTLs betwen IPv4 and IPv6 differ (it does for 4), which is the DNS provider (which is how I got the figures mentioned above), taken from the NS records, and how many use CNAME instead of A records (just a few). I also collected the number of A/AAAA records, in order to see how many (potentially) utilize round-robin DNS (187) (worth mentioning: the A records served to me may differ from those served to other users, which is also a way to do load balancing).
The results are a bit scary. The average TTL is only around 7600 seconds (2 hours and 6 minutes). But it gets worse when you look at the 50th percentile (sort the values by ttl and get the lowest 250). The average there is just 215 seconds. This means the DNS servers are hit constantly, which turns them into a real single point of failure and “the internet goes down” just after a few minutes of DDoS.
Just a few websites have a high TTL, as can be seen from this simple chart (all 500 sites are on the X axis, the TTL is on y):
What are the benefits of the short TTL? Not many, actually. You have the flexibility to change your IP address, but you don’t do that very often, and besides – it doesn’t automatically mean all users will be pointed to the new IP, as some ISPs, routers and operating systems may ignore the TTL value and keep the cache alive for longer periods. You could do the round-robin DNS, which is basically using the DNS provider as a load-balancer, which sounds wrong in most cases. It can be used for geolocation routing – serving different IP depending on the geographical area of the request, but that doesn’t necessarily require a low TTL – if caching happens closer to the user than to the authoritative DNS server, then he will be pointed to the nearest IP anyway, regardless of whether that values gets refreshed often or not.
Short TTL is very useful with internal infrastructure – when pointing to your internal components (e.g. a message queue, or to a particular service if using microservices), then using low TTLs may be better. But that’s not about your main domain being accessed from the internet.
Overlay networks like BitTorrent and Bitcoin use DNS round-robin for seeding new clients with a list of peers that they can connect to (your first use of a torrent client connects you to one of serveral domains that each point to a number of nodes that are supposed to be always on). But that’s again a rare usecase.
Overall, I think most services should go for higher TTLs. 24 hours is not too much, and it will be needed to keep your old IP serving requests for 24 hours anyway, because of caches that ignore the TTL value. That way services won’t care if the auhtoritative nameserver is down or not. And that would in turn mean that DNS providers would be less of an interesting target for attacks.
And I understand the flexibility that Dyn and Route53 give us. But maybe we should think of a more distributed way to gain that flexibility. Because yesterday’s attack may be just the beginning.
I’ve been waiting for talk of TTLs following this mess. Can anyone explain why so many massive companies use single dns provider with 5 min TTL? If you’re going to use DNS for load balancing (which seems like a horrible idea) why not use a alternative provider with a reasonable TTL to cover sustained primary outages?
Some companies use multiple NS across multiple TLD. They were all fine. Twitter? all 4 where dyn on .net.
Am I missing something? Where are the DNS experts here to explain a use case for ultra short TTLs?
DNS isn’t the problem – it seems like it’s being misused. Cache is a fundamental tenant of the internet and DNS records are probably the most cachable thing on the internet.
Cloud-based implementations require short TTLs as the ability to scale requires adding additional IPs into rotation. This is the case for Amazon ELBs, for example.
Additionally, when running content in multiple regions you want to be able to take a region out of service if there is a failure, again requiring a short ttl.
You might also be shifting traffic between regions as load changes, also requiring a short ttl.
Even locally you might want to be able to fail over between multiple F5-style load balancers, another short ttl.
It’s probably safer to assume that DDOS infected boxes will ignore TTL and pound a given target machine anyway. ISPs might want to consider blocking outbound port-53, and forcing such traffic to go through their own servers.
I don’t know a lot about what happened, but I manage a small bicycle shop in Colorado that uses a company called Cayan for payment processing. The DDoS attack on Dyn left us unable to process credit cards for much of the morning. It was an issue that we’ve never seen before. We were imprinting cards on carbon slips for the better part of 5 hours that day. Most of our customers (and employees, for that matter) have never even seen a knuckle dragger before.
Pretty surprising how something like this can affect people all over the nation.
One of the important benefit of sort TTL is to when your DNS IP is changed illegally. Once you change the IP to the correct server, you want the correct IP to propagate as soon as possible. This is where short TTL will help you.
“as some ISPs, routers and operating systems may ignore the TTL value and keep the cache alive for longer periods.”
I am not sure what would be the value of “Some ISP’s, routers, OS in terms of percentages. But if this is significant, then it would be kind of a shield against DDOS attack. But it is against standard practice to ignore TTL. This could have fallout in case the IP of a domain is changed and you still point to the OLD IP.
We should rather think of ways to secure IOT devices to contain future DDOS attack involving them.
Best would be to standardize devices so that they hit the market only when they are protected against most common misuses of either DNS or any other service.
On the distribution of DNS TTLs: https://00f.net/2012/05/10/distribution-of-dns-ttls/
In response to Frank’s argument, it wasn’t that one website was being hammered, it was that the authoritative nameserver was getting hammered; TTLs weren’t an attack vector.
However, since DNS is propagated, higher TTLs may have mitigated some of the problems. I found it interesting that OpenDNS’ nameservers were just working, meaning that their DNS implementation was ignoring TTLs or perhaps their implementation doesn’t flush answers until given an upstream response.
Either way, I’ve got a fundamental problem with nerfing anyone’s access to the internet under the guise of safety/security. As developers and admins, we can do better and we should do better. The answer isn’t to block outgoing UDP over port 53.
This DDoS should say to us that using a single DNS provider probably isn’t the best idea if high availability is a requirement. Attacking one DNS provider managed to break the internet: that shouldn’t happen today!
During a time where everything is in, “the cloud,” and (almost) everything is decentralized, it’s embarrassing to have a single DDoS attack take down numerous websites, because their answers to DNS queries are all centralized to a single provider.
I really like the idea of dns not flushing/ignoring ttl until it gets a response form nameserver. Wish someone would make an easy to install implementation for Openwrt/dd-wrt that did this. Would be nice to have your own private cashed stash of the last few thousand sites you visited, so you can mostly still function without outside dns for an extended period of time. Uverse seems to screw with dns requests once they leave your local network, I’ve noticed if I nslookup – 8.8.8.8 when I request a lookup, I’m really still connected to uverse dns, it’s like they redirect all 53 traffic back to them. You are basically dead if their dns is down, was told they do it to make sure the uverse streams aren’t messed with.