I recently gave a talk at AusNOG 2017 and given the presentation wasn’t recorded, I wanted to ensure it was documented. I had some very valuable feedback and insight from attendees following the session and I’ve attempted to address some of those points in this more detailed post. Accompanying slides for the talk are roughly followed in this post with a few edits.
I’ve spent a very large amount of time over the past few year thinking, proof-of-concept building and playing with various monitoring technologies. This post is a collection of my thoughts and ideas as I’ve collected them – hopefully they’re useful to you and I’m more than happy to hear any thoughts and feedback you might have.
Why do we need monitoring?
This should be a simple question for most of us to answer but I want to outline some of the more high-level reasons so we can address them later on.
React to Failure Scenarios
The worst possible scenario for a network operator is to have a critical component fail and not know – an unknown unknown. For us as operators to deliver reliable services, it’s critical we have complete insight into every possible dependent component in our network that delivers our services.
To ensure we scale and build our networks, platforms and applications accordingly, sufficient information is required to predict long and short-term trends.
With the right data and approach to analysis, many emergent behaviours can be derived to give new insight into how our customers are using our applications and networks.
Drive Continual Change
Network monitoring should play a key role in reacting to unexpected outages and ensuring they don’t recur. As business goals and priorities change, monitoring should be an important tool to ensure that applications and networks meet their goals and customer expectations.
Small and medium ISPs and enterprises generally tend towards out-of-box or entry level commercial software packages that promise visibility into a wide range of applications, devices and operating systems (PRTG, Zabbix, Solarwinds etc.). This is generally driven by limitations on time and resources and thus a software package that provides auto-discovery, “magic” configuration and lower overhead is the path of least resistance. Unfortunately, these types of packages tend to suffer from a lack of focus with regard to matching visibility with organisational goals. By trying to target many customer and software or hardware types, these packages can create noise and false positives by creating too many automatic graphs and alerts that aren’t focused on ensuring higher-level goals are met.
Meanwhile, the internet giants like Facebook, Google and Netflix are building and open sourcing their own tools due to a need to collect, analyse and visualise data of unprecedented scale and complexity. A perfect example is the FBTracert utility that is designed to identify network loss across complex ECMP fabrics. Unfortunately this approach comes at the cost of software development skill and resources that these companies already have that can be leveraged. The benefit is tools and utilities that provide insight that is custom built to inform the organisation about the quality of the product being delivered.
Limitations on the rest of us
Time and Resources
For obvious reasons, time and resources (be that money or human) are limited in small to medium ISPs and enterprises often requiring the minimum time and effort be put towards monitoring. Sometimes the minimum is enough and other times, more effort results in greater reward for the business and its operators.
Network Device Compute Power
Until recently, operators have had limited on-box processing power that has placed limitations on the volume and resolution of metrics collected. This has limited the features that vendors could implement (such as Netflow vs sFlow) and limited the rate operators could poll metrics due to processing time on-box. Only in the last few years have we started to see devices with excess processing power and the functions that allow operators to use it.
Vendor Feature Support
We have recently seen vendors realise that visibility and insight is a huge area that network devices have been lacking in. Features such as streaming telemetry and detailed flow-based tracking are slowly becoming available but only on flagship products ($$$$$). New software always means bugs and it’s my opinion that certain vendors are preferring to use their customer base as their bug reporters and not do as much testing as what’s required. Also, vendors will tend to implement features based on the market majority, of which smaller ISPs are usually not the target.
Previous Generation Devices
Unfortunately, we all don’t have the budget to rip and replace our networks every 2-3 years to keep up with the product lifecycle of the major vendors. We all have previous (and N-2/3/4) generation hardware in active, production use. This places limitations on the features we can deploy uniformly across our networks – this unfortunately means plenty of SNMP and expect scripts.
Top Down Approach
I’ve spent a lot of time thinking about how to go about network monitoring in an efficient way. This is a method that I’ve implemented in a few different types of networks (ISPs, enterprises, live events) and I feel it’s a very focused way of gaining insight from network metrics.
The key idea to keep in mind when building a network monitoring system is to focus on the business goals – what does my business / event / network aim to achieve? For most ISPs, this will be a measure of quality or performance for their customers. It’s up to you as the system designer and business analyst to figure out what constitutes a good experience and design your monitoring to detect when that goal isn’t being met.
Due to the limitations mentioned above, it’s not possible (in most cases) to collect every possible metric from every device, therefore a subset must be carefully selected. Start with your business goal and decide which metrics are required to accurately measure the state of this goal. For example, say an ISP wishes to ensure that their subscribers get the best performance possible during peak usage time – this can be measured through oversubscription ratio and uplink utilisation the terminating device. The required metrics are: number of connected sessions, ifHCInOctets, ifHCOutOctets, ifHCSpeed of the uplink from the terminating device. Assuming a known average subscriber speed we can calculate an oversubscription ratio and uplink utilisation (in percentage to normalise for varying uplink speed). Repeat this process across all aspects of the network that might affect business goals to build a list of metrics and measurements.
With focused measurements calculated, business rules can be applied to create low-noise, useful and actionable alerts. There is nothing worse as an operations engineer than being woken up in the middle of the night for something that can wait until the next day. Build a policy that defines exactly how to classify alert priorities and then map an appropriate alerting method to each priority. For example, free disk space reaching >80% doesn’t always justify an SMS and can easily be dealt with via a ticket or email to the appropriate team. Avoid too much noise from high priority alerts and save instant communication methods for potentially service impacting conditions requiring immediate action.
An effective monitoring system should be a key part of ensuring that your organisation is taking part in continual improvement. Whenever an unexpected outage or event occurs, go through the usual root cause investigation and perform those few steps to isolate the symptoms that indicate the issue and define appropriate priority alerts to prevent reoccurrence.
Monitoring for the rest of us
I wanted to keep this post at a high level and dive into the “How” in future posts but to give some guidance on how to go about this process of effective monitoring, I have a few recommendations.
Use the technology available to you
Most organisations will already have some sort of monitoring application in place (be it commercial or otherwise) and most platforms have enough knobs and dials to be coerced into behaving as I’ve described above. Take a step back, do the high level analysis of your business and network goals and then think about how you can use what you already have to structure an effective monitoring solution.
Design for Resilience
An an operations engineer, I am well aware of the “flying blind” feeling when monitoring stops working – this is not a comfortable feeling. Design your monitoring system to be as resilient as possible against network and underlying system outages. Even if you can’t get to monitoring data during an outage, ensure that metrics are preserved for later root cause analysis.
This can be achieved by decentralising the collection mechanisms and placing collectors as close as possible to the data sources – this reduces the number of dependent systems and devices that could affect the metric collection process. Storing information centrally is key to doing effective analysis and thus a means for bringing decentralised collector data into a central database is important. Queues and buffers are an excellent way to ensure persistence if the connection between a collector and central store are disrupted – many software platforms have support for Apache MQ, Kafka or other internal buffering mechanisms.
As mentioned previously, limitations on current software and hardware prevent operators from collecting every single statistic possible, however if you can store more than the minimum, then do so for future analysis. Ensure that any raw metrics used for further calculations are also stored as this can be key in doing historical analysis on new trends.
Meta monitoring is important – ensure your monitoring system is operating as expected by ensuring there is current data, services are in the correct state and network connections exist on expected ports. There is nothing worse than having a quiet day in operations to realise that monitoring hasn’t been working correctly.
I am really passionate about monitoring and I intend to go into further technical detail with reference to specific technologies in future posts. I presented this talk to help me construct and distill my ideas into something cohesive and I feel that I’ve learnt a lot about my own processes as I went.
I’m lucking in that my current employer allows me a lot of freedom to be innovative and build an effective monitoring system to drive continual improvement in the organisation. Anything specific to this project will be published on our blog – the first post of which is here: Metrics @ IX Australia.