About a year ago I joined up with the Infrastructure Team at RFLAN to assist in running what we believe is the largest LAN Party in Australia. With this brought many technical and logistical challenges I have really enjoyed. Below is an overview of the monitoring infrastructure the team and I created for the last RFLAN event.
After laying down many kilometres of cable, configuring services and monitoring infrastructure, RFLAN 54 was set to go. Based on the valuable feedback from the Lanners through the post-event survey it was identified that we needed better monitoring and insight into the event network. We believed that broadcast storms caused by layer 2 switching loops were the cause of poor network performance and set about creating detailed monitoring systems to hone in on the root cause.
The team chose a combination of CollectD for polling SNMP values from network devices (core and table switches) as well as querying other useful information from the wireless controller, DHCP and DNS servers as well as game servers. CollectD pushes metrics into a time series database called Influx which allows display with beautiful dashboards created in Grafana.
The key lesson we learnt from previous events was to ensure we knew which problems we want to avoid and to collect metrics to aid diagnosis when they do occur. RFLan runs with a large, single layer 2 broadcast domain with up to 800 hosts sharing the same VLAN; this is due to the nature of game server auto-discovery. This configuration can cause widespread disruptions when layer 2 loops occur; such as when lanners bring their own network devices that have not be configured correctly. Additionally, these devices are not visible to the network admins and thus causes the job of diagnosing issues to take much longer. For identifying the source of network loops, we collected all broadcast, unicast and interface counters and crafted dashboards that would show us exactly when and where a loop was occurring. Unfortunately, due to limitations of the desk switches we use, we can only narrow the issue down to a single table switch and not to a set of ports directly.
During the event, the Network Operations Centre (NOC) was notified of Lanners receiving multiple disconnects from gaming sessions, a classic sign of a network disruption. Upon inspection of the graphs a single table switch was identified to be flooding in excess of 20,000 packets/sec of broadcast traffic which was then replicated to all other table switches creating a broadcast storm. After shutting down the identified table switch, admins slowly began the process of bringing each port online until the looping lanner was identified. Upon removal of a foreign device connected to a table switch port, the network returned to normal and operated as expected.Here is the dashboard we created that highlights exactly the location of the broadcast storm based on the number of received broadcast packets from table switches.
Below are a few of the awesome dashboards the admins created before and during the event to provide heads-up information across all the critical services of the event.