You Can’t Handle the DNS Truth
avatar

Problem to Solve – My company is experiencing application response time and DNS issues in our IPV4 environment.    What could be the issue to troubleshoot and resolve this pain?

handle-DNS-truthThis topic is so easy and well documented I paused before actually deciding to put fingers to keyboard. The tipping point for me was a cry for help from a customer of ours who was at wit’s end. The culprit once again was infamous IPV6 queries or as we in the know like to call them “AAAA” or “quad-A” queries. Just jump to your favorite search engine and enter any variation of “IPV6 queries slowing down my DNS resolution” and you will receive a voluminous list in return.

The cause behind this phenomenon is driven by one metric, time. When a Microsoft Windows client attempts to connect to a resource using a Domain name the client will use IPV6 before IPV4. “So what?” you might ask.  Well there are some rules to this interaction. If you look below at table I referenced from https://support.microsoft.com/en-us/kb/2834226 you will notice a run time of 10 seconds for Windows 7 client with two DNS servers either configured or provided by DHCP.

The behavior is the following (tested on Windows 7 and Windows 8 clients with a single NIC):

Table:
Time (seconds since start) Action
0 Client queries the first DNS server of the list
1 If no response is received after 1 second, client queries the second DNS server of the list
2 If no response is received after 1 more second, client queries again the second DNS server of the list
4 If no response is received after 2 more seconds, client queries all the servers in the list at the same time
8 If no response is received after 4 more seconds, client queries all the servers in the list at the same time
10 If no response is received after 2 more seconds, client stops querying

Ten seconds may seem bad enough all by itself.  But since a IPV6 query series is processes first, that is 10 seconds before the IPV4 requests begin and potentially run through the same process again. Thankfully, most companies I visit, the first IPV4 request and response are successful. So, the DNS resolution is successful after 11 seconds. The truly amazing thing is how conditioned users can become. In investigating this issue with a subset of users distributed throughout the organization, we found that all of them chalked this issue up to “everything being slow first thing in the morning when everyone is getting on-line”. These resolutions will be cached on each client so subsequent resolution requests will be fulfilled locally from the client device itself. Most likely this will last the rest of the day or until the device is disconnected from the network or turned off.

Thanks Microsoft …..

According to Microsoft Support, disabling IPV6 is not recommended or supported, https://technet.microsoft.com/en-us/network/cc987595.aspx#EBE  Yes, there are certainly registry “HACKS” that can disable IPV6.  You can also attempt to alter the BINDING order in the IP stack as well. One easy mitigation I have often seen is to add a command line statement to a Group Policy logon script that disables the IPv6 queries, “netsh int ipv6 6to4 set state disabled”. In the end, Microsoft says this will cause more issues or problems down the road though. But they don’t elaborate on what those “problems” might be.  Thanks Microsoft.

So maybe it is just time to pack it in and say to yourself “That’s just how it is”? Who is really getting hurt? I should just disable these alarms in my monitoring tools or ignore them if I am still using raw packet trace file forensics and just get on with my day. There are days when I might agree with that decision. Before we do that, let’s at least run the numbers.

The DNS Truth 

The average DNS request is going to be around 80 bytes. The average reply will be about the same length. Things like QoS and VLAN markings and the like can certainly affect that length. If you have 1000 people all logging in around the same time, and IPV6 failures are expected, there most likely will be 6 queries and one response per client for each resolution that is not locally cached. So, as users associate with domain controllers, authenticate, authorize,attach to network drives and Sharepoint, Email, launch their business applications, each user could generate as much as 560 bytes of network traffic per resource attachment and application launch. If the average user makes 10 DNS requests at the beginning of the day alone, that is 5.6 Kilobytes which equals 5.6 Megabytes or 44.8 Megabits per 1000 users added to the overall network utilization. The time waiting for DNS to resolve is probably going to be 10 seconds per single request going through the chart above, multiple by 10 requests per user, that could add up to 100 seconds {less than 2 minutes} added to the beginning of an average person’s work day. Or 100,000 seconds or potentially wasted productivity for 1000 users. I know, seems like something an actuary at an insurance company would be all upset with rather than something an IT department should worry itself. Add in some WAN delay and you can start to see how this could be deemed a real problem worth fixing for some organizations. Like we don’t have enough work to do already.

Thanks again … Microsoft

I wish there was a quick Microsoft supported fix I could recommend for dealing with this problem. So many organizations just heap this situation on top of the political pie between server administration and network administration. The network folks blaming the server folks and each holding the other responsible for fixing the root cause. Microsoft’s recommendations in this area is to set up your IPV6 network settings for your Domain and DNS authorities and for each and every NIC in these same servers. If this is done, Microsoft claims that client default IPV6 requests will be resolved with both an IPV4 address and an IPV6 address, and whatever version IPV6 or IPV4 GATEWAY {router} IP address leads to the resolved resource will automatically be used.

I could continue on and share several war stories passed along or experienced by myself related to this topic. Surely most of you who have invested a lot of time reading this so far probably could as well. I have seen DR sites that took hours to come on line due to DNS issues affecting multi-tiered service applications connecting to each other, the old front-end /middle-ware/DB-backend combination. Take a look at this article for a multi-tier viewpoint http://problemsolverblog.czekaj.org/troubleshooting/dont-let-a-multi-tier-application-make-you-multi-tear-up/ I have also seen Load Balancers that have melted down over misconfigured DNS resolution on backend servers, that is one of my favorite stories.

In closing, none of us need more work to do and certainly would not be praised for “looking for trouble in all the wrong places”.  But if there is a project on the horizon that would encompass DNS, like data center consolidations and/or virtualization, or bringing a new data center on-line, or SDN, it would certainly be worth the effort and time to consider.   Creating a plan enabling a critical service like DNS for the inevitability of IPV6 requirements & dependencies of your networking and OS vendors will help you in the end.    This also might be a good opportunity to take a quick look into your existing DNS tool sets (from an APM / NPM perspective), and verify that you can “see” the behavior of your IPV4 and IPV6 DNS requests.