Switching Archives - Tim Raphael

Switching

27 July, 2016

Network Monitoring @ RFLAN 54

Tim Network Monitoring, RFLAN, Switching 2 Comments

This blog post was originally published on the RFLAN – Redflag Lanfest Facebook page and has been re-syndicated here with only opening remarks added.

About a year ago I joined up with the Infrastructure Team at RFLAN to assist in running what we believe is the largest LAN Party in Australia. With this brought many technical and logistical challenges I have really enjoyed. Below is an overview of the monitoring infrastructure the team and I created for the last RFLAN event.

After laying down many kilometres of cable, configuring services and monitoring infrastructure, RFLAN 54 was set to go. Based on the valuable feedback from the Lanners through the post-event survey it was identified that we needed better monitoring and insight into the event network. We believed that broadcast storms caused by layer 2 switching loops were the cause of poor network performance and set about creating detailed monitoring systems to hone in on the root cause.

The team chose a combination of CollectD for polling SNMP values from network devices (core and table switches) as well as querying other useful information from the wireless controller, DHCP and DNS servers as well as game servers. CollectD pushes metrics into a time series database called Influx which allows display with beautiful dashboards created in Grafana.

Flow chart showing the collection of monitoring tools and how they are linked together. — *Here is the flow of data as it moves through our monitoring stack consisting of CollectD, InfluxDB and Grafana.*

The key lesson we learnt from previous events was to ensure we knew which problems we want to avoid and to collect metrics to aid diagnosis when they do occur. RFLan runs with a large, single layer 2 broadcast domain with up to 800 hosts sharing the same VLAN; this is due to the nature of game server auto-discovery. This configuration can cause widespread disruptions when layer 2 loops occur; such as when lanners bring their own network devices that have not be configured correctly. Additionally, these devices are not visible to the network admins and thus causes the job of diagnosing issues to take much longer. For identifying the source of network loops, we collected all broadcast, unicast and interface counters and crafted dashboards that would show us exactly when and where a loop was occurring. Unfortunately, due to limitations of the desk switches we use, we can only narrow the issue down to a single table switch and not to a set of ports directly.

During the event, the Network Operations Centre (NOC) was notified of Lanners receiving multiple disconnects from gaming sessions, a classic sign of a network disruption. Upon inspection of the graphs a single table switch was identified to be flooding in excess of 20,000 packets/sec of broadcast traffic which was then replicated to all other table switches creating a broadcast storm. After shutting down the identified table switch, admins slowly began the process of bringing each port online until the looping lanner was identified. Upon removal of a foreign device connected to a table switch port, the network returned to normal and operated as expected.Here is the dashboard we created that highlights exactly the location of the broadcast storm based on the number of received broadcast packets from table switches.

A graph showing the link bundles with their received broadcasts statistics. — *All the link bundles showing their received number of broadcast packets per second.*

A graph showing a large peak on a single link bundle indicating a high rate of received broadcasts. — *The identified bundle narrowed down to the time of the network disruptions.*

Below are a few of the awesome dashboards the admins created before and during the event to provide heads-up information across all the critical services of the event.

Internet Bandwidth

DNS Infrastructure

A Grafana Dashboard with DNS statistics. — *Using an application called DNSDist that allows us to load-balance upstream servers and pull all of these useful statistics.*

Wireless Infrastructure

A Grafana dashboard showing wireless statistics and gauges. — *Pulling statistics from the Ubiquity Wifi Controller with custom-built scripts.*

DHCP Leases

NOC Dashboard

A dashboard showing an overall view of all event systems. — This was the main dashboard the senior network admins used to monitor the overall health of the network. This was run on a 4K TV driven by an Intel NUC SFF PC.

13 February, 2015

Investigating Micro-Bursting

Tim Switching, Wireshark 0 Comments

As modern data networks get faster and faster, we’re staring to see the bounds of Internet traffic demands grow at an alarming rate. This phenomenon has put a lot of pressure on network engineers to monitor their network traffic levels and scale capacity accordingly. We tend to rely on our capacity monitoring tools to let us know exactly how utilized our network is and when we should expect to grow – this is all well and good unless our tools don’t paint an entirely accurate picture.

Most SNMP monitoring tools poll the input and output octet counters of a device at a pre-defined time interval and then calculate an average over that period of time. For example; a device that switches 120MB in 60 seconds equals an average data rate of 16Mbps. This average doesn’t take into account the fact that the interface might have done 100MB in the first 10 seconds (80 Mbps) and the other 20 in the latter 40 seconds (4Mbps). You can see from this example that an average of 16Mbps isn’t even close to accurate for interface utilisation.

A Microburst is simply a very short time period where enough traffic is moving through an interface to hit it’s upper limit and causing packets to be dropped. The time period is often too short to be detected by the usual array of network monitoring tools.

Recently my monitoring tools highlighted an uplink trunk interface that had an abnormal number of interface discards, only after a lot of reading and troubleshooting was I interested to find out if micro-bursting might be the culprit.

Screen Shot 2015-02-02 at 8.13.01 am — A graph showing 30 second averages of the gigabit interface showing discards during peak times.

To to find if micro-bursting was to blame and additionally what was the cause, I broke out a high powered packet capturing server loaded with Ubuntu, tcpdump and Wireshark. The uplink trunk in question was running (according to my monitoring tools) at an average of 200-300mbps in each direction during normal business hours and thus I needed the CPU and disk IO to keep up with minimum of ~60MB/sec of disk IO and a potential maximum of 250MB/sec (1Gbit each way) for many minutes to try and capture a microburst in action.

Multiple 4GB+ packet captures later from a SPAN (mirror) port with tcpdump, I was able to load the results into Wireshark for analysis. As each packet is captured from the SPAN port, it is time-stamped very accurately to allow us to find periods of high utilization. After waiting a while (at least two coffees) for the capture file to load within Wireshark, select “Graph IO” from the Analysis menu to generate a graph of packets, bytes or bits per second based on the timestamps accompanying the packets.

Screen Shot 2015-02-02 at 8.12.35 am — Adjust the Tick Interval and let Wireshark recalculate the graph down to that time interval.

Wireshark allows you to change the time period of each graph interval down in to 1/10^th, 1/100^th or even 1/1000^th of a second to see if you are experiencing microbursts. In my case, I was able to find smaller periods of time where traffic was saturating the 1Gbps uplink.

A graph generated by Wireshark showing micro-bursting activity. — The above graph shows the interface hitting 1Gbps and flat lining for about 2/10th of a second, a micro-burst in action.

Using the x-axis of the graph as a time indicator, I was able to look through the contents of the capture and determine that my caching farm was actually pushing out 1Gbps to some client requests and thus causing the interface drops.

Lesson learnt: use your monitoring tools to gauge network load and predict capacity but don’t take them as gospel when long averages are involved.