Packets Lost in Space

Problem to Solve – My company is having application performance issues, and there is a high suspicion they are being caused by network packet loss.    What can we do ?

lostinspace_packetsOne of the most exasperating activities for any networking person is defending the networks involvement in creating packet loss, which is closely followed by trying to isolate, or triage, the cause of packet loss.

This mysterious diagnosis to a common complaint is anything but intuitive.  Complaints of slowness in applications or “screen freezes” (one of my personal favorites), often send IT staff in all directions.  SNMP polling solutions are often the first stop in this triage process.   It usually involves looking for things like servers with high CPU utilization,  or running low on memory, overtaxed routers, or links that are saturated.  “Walking” through server logs looking for application level issues often closely follow the quick romp through SNMP Trap land.  On and on we run down the “proverbial hall” opening door after door, with each door opening another tool or view or login into another networking device.

The Obvious….and Not So Obvious

Depending on what tools and expertise your organization has access to, the best approach to many/most problems is to eliminate what the problem is NOT.  If your “go to” monitor products are all showing green status, you can often dismiss most of the obvious problem.

The next step then becomes looking for the “not so obvious“.  Just a few years ago, vendors were proud of themselves for automating the laborious task of calculating response times for TCP for different applications, (or I should say well known application TCP port numbers).  That can be a good place to start.  Metrics (either produced on demand or continually monitored) that represent longer than expected or previously observed TCP response times can really help.   Ideally, these metrics should be contrasted with increasing utilization on the network links and segments between the user(s) and the service(s) to greatly narrow your possible causes and further investigation choices.

This is the point where investigators and analysts will leave packet loss possibilities until later or even last.  Why? It is usually the hardest to track down.  But that doesn’t need to be the case.  You do not need to look at two “side by side trace files” matching packets between the two files.  Nor do you need to worry about having measurement equipment or tools at each end either.

TCP will show you the way

Let’s discuss some of the inner workings of TCP as it relates to how and when and what TCP does in reaction to different network, server, and application issues.  Starting with something easy is often a good way to start to build your knowledge.  So, let’s examine VoIP.  Why do so many tools provide the ability to identify packet loss with instrumentation, tools, and investigation at just one end?  Because audio and video protocols have very predictable and well defined sizes and payloads.  This is represented in the form of SEQUENCE numbers in different protocols.  In both directions of such a data stream, we know with certainly what the next expected SEQUENCE number is going to be.  When the difference between the SEQUENCE numbers is greater than expected, we know exactly how much packet loss has occurred.  Not long ago, that process was done by hand by walking trace files. But nearly all tools and instrumentation today can provide this streamlined workflow and logic for you.

In addition, since most of these audio and video solutions are implemented using QoS marking within your network equipment, most equipment vendors supply an SNMP OID {variable} that tracks the number of QoS related drops.  This means that resources are exhausted and QoS policy is used to decide what packets are going to be dropped.

An even further low tech method of identifying the possibility of packet loss (that requires little effort) is to compare the packet count between two interfaces in the data path of the traffic.  Even this can turn laborious though if you have many different paths and segments between the user(s) and the service(s).

The twist here (when not talking about audio and video services), is that a normal application’s manipulation of TCP is dynamic.  There are no set, fixed, measurable points or markings to compare.  When variables like server load, latency over the network, components and propagation delay, and client load are in play, TCP markings in the FLAGS, packet size,TCP Window size and a slew of other control mechanisms react to optimize the TCP connection as best as possible.  Some measuring and monitoring tools and command line utilities may provide a measurement of ReTX.  But ReTX is not an indicator of packet loss.  Most often, ReTX is an indicator that either the client or the server is running too slowly or is over taxed and is asking for the sender to send the last information again.

We got SACKed

So, by this time – you are probably saying “there has to be an easier way?”.  Well, I am here to provide a few measurable metrics that definitively indicate that there has been packet loss for any application, SACKS (Selective Acknowledgements) and Duplicate ACKS (as in Acknowledgements from clients and servers that they have received the information from each other).

Let’s do a quick walk through of both the above packet loss recovery indicators.  SACKS are the result of a very efficient implementation of the TCP stack by client and server hardware.  If a Selective Acknowledgement is observed, within it will be the information (receivers of the packets) about what actual data is MISSING from the last burst of packets.  This is almost ALWAYS an indicator of packet loss between the sender going to the receiver.  Another very effective TCP FLAG to track and measure over time are Duplicate ACKs.  This is most often sent by a receiver of packets when they have already sent an ACK to the last burst of packets within a TCP Window (Yet, it has not started to receive the next burst from the sender within an expected time).  So the receiver again sends a duplicate ACK with the same Sequence numbers and expected NEXT Sequence numbers to the sender.  This indicates packet loss from the receiver back to the sender.  Why is directionality important?  If there are multiple paths that packets can take going to or coming from two entities on the network, it can become critically important to your triage and mediation efforts.

So, to wrap this rather technical dissertation up, if you are fortunate enough to have access to tools that measure and/or monitor these occurrences for you, include a look at them early in your triage efforts.  If your current tool sets do not take advantage of these particular metrics, you can still build filters in most of the packet decode tools I have seen.  The filters  can provide a list of the packets that match one or both of these conditions.

Shakespeare: To SPAN or not to SPAN ..

Problem to Solve: How can we troubleshoot network traffic that is running inside of a switched network environment without impacting the production traffic? As Shakespeare used to say in the 1600’s when dealing with the triage and troubleshooting of applications and networks … “to … Continue reading