Musings on Network Monitoring
I’ve recently been spending some time looking into the world of network monitoring. It’s been mentioned many times through a lot of my readings that a strong network monitoring strategy is key to staying on top of your infrastructure. I’ve done much reading, investigating and testing of the various monitoring platforms out there and I’ve come to the conclusion that there is no single application or suite that suits my needs. I did however, find that network monitoring falls into four rough categories:
Priority one on my list of features for an NMS is the ability to detect change in device state or trigger at some threshold. Alerting allows an organisation to run a 24/7 operation without paying for a round the clock engineering team. A poorly configured alerting system on the other hand can trigger false positives that may hide actual underlying problems. A good NMS should allow significant control of how and when alerts are raised but shouldn’t be so complex that it takes an inordinate amount of time to configure correctly.
Graphing of performance statistics collected usually from SNMP, WMI or via web API provide engineers insight into trends and capacity issues that may eventually cause problems. A graphing system that allows engineers to accurately analyse the data and predict problems and future trends is invaluable for networks that have the potential to grow. Be aware though, some monitoring systems have minimum intervals of data collection that may paint an inaccurate picture of network utilisation when they are too high. For example, most systems cannot detect micro-bursting by querying SNMP counters.
Network devices spit out a tonne of useful information in the way of syslogs; anything from hardware error notifications, protocol state changes and environmental warnings can be collected through a well configured log management system. If the log collection is fine tuned enough, you can link the logging platform to your alerting system and wake someone up if an error is definitely an indicator of problems. Dumping the logs to disk in text files leaves them unaccessible and therefore not useful. Powerful analysis tools can be used to collect, search and graph logs for outage diagnosis and postmortem root cause analysis which otherwise might be much harder to perform.
Last but not least device management features as part of an NMS ensure that engineering teams stay on top of vendor software updates, bugs and support contract agreements. Having all this information in one place makes for easy management of a large number of devices when contracts come up for renewal and new software versions are released. Good device management platforms also keep accurate inventory of the various hardware variations, line cards and modules as well as their serial numbers for warranty purposes.
There will be come product suits that cover all these areas and may suit your needs however, my best advise is work out what features you require, create a matrix and assign priority values to each. Then, once you begin looking at all the options on the market, you have an excellent tool that will assist you to make the best choice. Also keep in mind that two or maybe three different systems may need to be chosen to get the coverage of the four categories mentioned above.
For a good place to start, here is my list of a few of the better options on the market for both Open Source and Paid variants.
- ELK (ElasticSearch / LogStash / Kibana)
- Nagios and it’s variants (OpsView / Icinga / Check_MK)
- ManageEngine OpManager
- Vendor-based NMS (JunOS Space / Cisco Prime)