Network Monitoring @ RFLAN 54

This blog post was originally published on the RFLAN – Redflag Lanfest Facebook page and has been re-syndicated here with only opening remarks added.

About a year ago I joined up with the Infrastructure Team at RFLAN to assist in running what we believe is the largest LAN Party in Australia. With this brought many technical and logistical challenges I have really enjoyed. Below is an overview of the monitoring infrastructure the team and I created for the last RFLAN event.

After laying down many kilometres of cable, configuring services and monitoring infrastructure, RFLAN 54 was set to go. Based on the valuable feedback from the Lanners through the post-event survey it was identified that we needed better monitoring and insight into the event network. We believed that broadcast storms caused by layer 2 switching loops were the cause of poor network performance and set about creating detailed monitoring systems to hone in on the root cause.

The team chose a combination of CollectD for polling SNMP values from network devices (core and table switches) as well as querying other useful information from the wireless controller, DHCP and DNS servers as well as game servers. CollectD pushes metrics into a time series database called Influx which allows display with beautiful dashboards created in Grafana.

Flow chart showing the collection of monitoring tools and how they are linked together.
Here is the flow of data as it moves through our monitoring stack consisting of CollectD, InfluxDB and Grafana.

The key lesson we learnt from previous events was to ensure we knew which problems we want to avoid and to collect metrics to aid diagnosis when they do occur. RFLan runs with a large, single layer 2 broadcast domain with up to 800 hosts sharing the same VLAN; this is due to the nature of game server auto-discovery. This configuration can cause widespread disruptions when layer 2 loops occur; such as when lanners bring their own network devices that have not be configured correctly. Additionally, these devices are not visible to the network admins and thus causes the job of diagnosing issues to take much longer. For identifying the source of network loops, we collected all broadcast, unicast and interface counters and crafted dashboards that would show us exactly when and where a loop was occurring. Unfortunately, due to limitations of the desk switches we use, we can only narrow the issue down to a single table switch and not to a set of ports directly.

During the event, the Network Operations Centre (NOC) was notified of Lanners receiving multiple disconnects from gaming sessions, a classic sign of a network disruption. Upon inspection of the graphs a single table switch was identified to be flooding in excess of 20,000 packets/sec of broadcast traffic which was then replicated to all other table switches creating a broadcast storm. After shutting down the identified table switch, admins slowly began the process of bringing each port online until the looping lanner was identified. Upon removal of a foreign device connected to a table switch port, the network returned to normal and operated as expected.Here is the dashboard we created that highlights exactly the location of the broadcast storm based on the number of received broadcast packets from table switches.

A graph showing the link bundles with their received broadcasts statistics.
All the link bundles showing their received number of broadcast packets per second.
A graph showing a large peak on a single link bundle indicating a high rate of received broadcasts.
The identified bundle narrowed down to the time of the network disruptions.

Below are a few of the awesome dashboards the admins created before and during the event to provide heads-up information across all the critical services of the event.

Internet Bandwidth

A graph showing internet bandwidth across the 24 hours of the event.
Internet connectivity during the event. We topped out at 6.4Gbit/sec just after midday on the Saturday.

DNS Infrastructure

A Grafana Dashboard with DNS statistics.
Using an application called DNSDist that allows us to load-balance upstream servers and pull all of these useful statistics.

Wireless Infrastructure

A Grafana dashboard showing wireless statistics and gauges.
Pulling statistics from the Ubiquity Wifi Controller with custom-built scripts.

DHCP Leases

A smooth curved graph showing active DHCP leases during the event.
Count of DHCP leases active during the event.

NOC Dashboard

A dashboard showing an overall view of all event systems.
This was the main dashboard the senior network admins used to monitor the overall health of the network. This was run on a 4K TV driven by an Intel NUC SFF PC.