Investigating Micro-Bursting

As modern data networks get faster and faster, we’re staring to see the bounds of Internet traffic demands grow at an alarming rate. This phenomenon has put a lot of pressure on network engineers to monitor their network traffic levels and scale capacity accordingly. We tend to rely on our capacity monitoring tools to let us know exactly how utilized our network is and when we should expect to grow – this is all well and good unless our tools don’t paint an entirely accurate picture.

Most SNMP monitoring tools poll the input and output octet counters of a device at a pre-defined time interval and then calculate an average over that period of time. For example; a device that switches 120MB in 60 seconds equals an average data rate of 16Mbps. This average doesn’t take into account the fact that the interface might have done 100MB in the first 10 seconds (80 Mbps) and the other 20 in the latter 40 seconds (4Mbps). You can see from this example that an average of 16Mbps isn’t even close to accurate for interface utilisation.

A Microburst is simply a very short time period where enough traffic is moving through an interface to hit it’s upper limit and causing packets to be dropped. The time period is often too short to be detected by the usual array of network monitoring tools.

Recently my monitoring tools highlighted an uplink trunk interface that had an abnormal number of interface discards, only after a lot of reading and troubleshooting was I interested to find out if micro-bursting might be the culprit.

Screen Shot 2015-02-02 at 8.13.01 am
A graph showing 30 second averages of the gigabit interface showing discards during peak times.

To to find if micro-bursting was to blame and additionally what was the cause, I broke out a high powered packet capturing server loaded with Ubuntu, tcpdump and Wireshark. The uplink trunk in question was running (according to my monitoring tools) at an average of  200-300mbps in each direction during normal business hours and thus I needed the CPU and disk IO to keep up with minimum of ~60MB/sec of disk IO and a potential maximum of 250MB/sec (1Gbit each way) for many minutes to try and capture a microburst in action.

Multiple 4GB+ packet captures later from a SPAN (mirror) port with tcpdump, I was able to load the results into Wireshark for analysis. As each packet is captured from the SPAN port, it is time-stamped very accurately to allow us to find periods of high utilization. After waiting a while (at least two coffees) for the capture file to load within Wireshark, select “Graph IO” from the Analysis menu to generate a graph of packets, bytes or bits per second based on the timestamps accompanying the packets.

Screen Shot 2015-02-02 at 8.12.35 am
Adjust the Tick Interval and let Wireshark recalculate the graph down to that time interval.

Wireshark allows you to change the time period of each graph interval down in to 1/10th, 1/100th or even 1/1000th of a second to see if you are experiencing microbursts. In my case, I was able to find smaller periods of time where traffic was saturating the 1Gbps uplink.

A graph generated by Wireshark showing micro-bursting activity.
The above graph shows the interface hitting 1Gbps and flat lining for about 2/10th of a second, a micro-burst in action.

Using the x-axis of the graph as a time indicator, I was able to look through the contents of the capture and determine that my caching farm was actually pushing out 1Gbps to some client requests and thus causing the interface drops.

Lesson learnt: use your monitoring tools to gauge network load and predict capacity but don’t take them as gospel when long averages are involved.