Intermittant outtages on .org sites (i.e. ncsasports.org)
We’re tracking a problem that’s manifesting itself as intermittent outages to .org domains. what appears to be happening is that sometimes the .org DNS servers will return a null response instead of the authoritative servers. This results in our local DNS servers caching a “null value” on the response and the site appears down until the cache expires and the full recursive lookup happens again.
Here’s an example of a failed recursive lookup:
sfrazer-mbp:~ sfrazer$ dig +trace www.prairie.org
; <<>> DiG 9.4.2-P2 <<>> +trace www.prairie.org
;; global options: printcmd
. 79601 IN NS l.root-servers.net.
. 79601 IN NS j.root-servers.net.
. 79601 IN NS c.root-servers.net.
. 79601 IN NS k.root-servers.net.
. 79601 IN NS i.root-servers.net.
. 79601 IN NS d.root-servers.net.
. 79601 IN NS b.root-servers.net.
. 79601 IN NS f.root-servers.net.
. 79601 IN NS a.root-servers.net.
. 79601 IN NS m.root-servers.net.
. 79601 IN NS e.root-servers.net.
. 79601 IN NS h.root-servers.net.
. 79601 IN NS g.root-servers.net.
;; Received 449 bytes from 192.168.0.21#53(192.168.0.21) in 11 msorg. 172800 IN NS C0.ORG.AFILIAS-NST.INFO.
org. 172800 IN NS D0.ORG.AFILIAS-NST.org.
org. 172800 IN NS A0.ORG.AFILIAS-NST.INFO.
org. 172800 IN NS A2.ORG.AFILIAS-NST.INFO.
org. 172800 IN NS B0.ORG.AFILIAS-NST.org.
org. 172800 IN NS B2.ORG.AFILIAS-NST.org.
;; Received 435 bytes from 192.58.128.30#53(j.root-servers.net) in 31 msorg. 0 IN SOA a0.org.afilias-nst.info. noc.afilias-nst.info. 2008502420 1800 900 604800 86400
;; Received 96 bytes from 199.19.56.1#53(A0.ORG.AFILIAS-NST.INFO) in 49 mssfrazer-mbp:~ sfrazer$
A0.ORG.AFILIAS-NST.INFO should have returned a list of our DNS servers, which would then be queried.
In short, the issue is out of our control, as our DNS servers remain healthy and serving the correct content, and the websites themselves are still up, even though some people will be unable to get to them.
Because we set our Time To Live on DNS zones to 5 mintues, the outtages generally don’t last long (the cache expires quickly, and is refilled) but the request rate is higher, so people are more likely to see the problem. The alternative would be longer TTL settings which would reduce the number of times people saw the problem, but would lengthen the time until the problem resolved itself.
Update: The problem has apparently been resolved. More information here.


[...] DNS issues Yesterday we experienced an issue reaching some of our .org domains and I wanted to write a bit about the troubleshooting process I used to determine what the problem [...]
Leave a Reply