Tim Raphael - The musings of a network/software engineer

16 September, 2017

InfluxDB Integrated with Prometheus

For a while I’ve been keeping an eye on Prometheus as I’ve been fascinated with its architecture and philosophy. The one small detail that has prevented me from pursuing it as a production tool (for network monitoring specifically) is the limitations around long-term storage of metrics. Prometheus is designed to keep a certain period of data and then expire it out from its local storage. I’ve been monitoring this GitHub issues for quite a while as it expresses the need to allow Prometheus to forward metrics to other time series databases for long-term storage.

Paul Dix presented a talk at PromCon 2017 (blog at InfluxData) announcing that InfluxData have improved on the experimental remote read/write example code and added full support for InfluxDB as a remote storage backend. Metrics from Prometheus can be forwarded to InfluxDB as well as being stored locally for fast analysis for a defined time (e.g. 90 days) and allow PromQL to query data from both sources at the Prometheus layer.

This is something I’d love to spend some time investigating further as I really love the community around Prometheus as well as their design philosophy for the project.

Links:

Network Monitoring for the Rest of Us.001

14 September, 2017

Modern network monitoring for the rest of us

Tim Network Monitoring 0 Comments

I recently gave a talk at AusNOG 2017 and given the presentation wasn’t recorded, I wanted to ensure it was documented. I had some very valuable feedback and insight from attendees following the session and I’ve attempted to address some of those points in this more detailed post. Accompanying slides for the talk are roughly followed in this post with a few edits.

Intro

I’ve spent a very large amount of time over the past few year thinking, proof-of-concept building and playing with various monitoring technologies. This post is a collection of my thoughts and ideas as I’ve collected them – hopefully they’re useful to you and I’m more than happy to hear any thoughts and feedback you might have.

Why do we need monitoring?

This should be a simple question for most of us to answer but I want to outline some of the more high-level reasons so we can address them later on.

React to Failure Scenarios

The worst possible scenario for a network operator is to have a critical component fail and not know – an unknown unknown. For us as operators to deliver reliable services, it’s critical we have complete insight into every possible dependent component in our network that delivers our services.

Capacity Planning

To ensure we scale and build our networks, platforms and applications accordingly, sufficient information is required to predict long and short-term trends.

New Trends

With the right data and approach to analysis, many emergent behaviours can be derived to give new insight into how our customers are using our applications and networks.

Drive Continual Change

Network monitoring should play a key role in reacting to unexpected outages and ensuring they don’t recur. As business goals and priorities change, monitoring should be an important tool to ensure that applications and networks meet their goals and customer expectations.

Current State

Small and medium ISPs and enterprises generally tend towards out-of-box or entry level commercial software packages that promise visibility into a wide range of applications, devices and operating systems (PRTG, Zabbix, Solarwinds etc.). This is generally driven by limitations on time and resources and thus a software package that provides auto-discovery, “magic” configuration and lower overhead is the path of least resistance. Unfortunately, these types of packages tend to suffer from a lack of focus with regard to matching visibility with organisational goals. By trying to target many customer and software or hardware types, these packages can create noise and false positives by creating too many automatic graphs and alerts that aren’t focused on ensuring higher-level goals are met.

Meanwhile, the internet giants like Facebook, Google and Netflix are building and open sourcing their own tools due to a need to collect, analyse and visualise data of unprecedented scale and complexity. A perfect example is the FBTracert utility that is designed to identify network loss across complex ECMP fabrics. Unfortunately this approach comes at the cost of software development skill and resources that these companies already have that can be leveraged. The benefit is tools and utilities that provide insight that is custom built to inform the organisation about the quality of the product being delivered.

Limitations on the rest of us

Time and Resources

For obvious reasons, time and resources (be that money or human) are limited in small to medium ISPs and enterprises often requiring the minimum time and effort be put towards monitoring. Sometimes the minimum is enough and other times, more effort results in greater reward for the business and its operators.

Network Device Compute Power

Until recently, operators have had limited on-box processing power that has placed limitations on the volume and resolution of metrics collected. This has limited the features that vendors could implement (such as Netflow vs sFlow) and limited the rate operators could poll metrics due to processing time on-box. Only in the last few years have we started to see devices with excess processing power and the functions that allow operators to use it.

Vendor Feature Support

We have recently seen vendors realise that visibility and insight is a huge area that network devices have been lacking in. Features such as streaming telemetry and detailed flow-based tracking are slowly becoming available but only on flagship products ($$$$$). New software always means bugs and it’s my opinion that certain vendors are preferring to use their customer base as their bug reporters and not do as much testing as what’s required. Also, vendors will tend to implement features based on the market majority, of which smaller ISPs are usually not the target.

Previous Generation Devices

Unfortunately, we all don’t have the budget to rip and replace our networks every 2-3 years to keep up with the product lifecycle of the major vendors. We all have previous (and N-2/3/4) generation hardware in active, production use. This places limitations on the features we can deploy uniformly across our networks – this unfortunately means plenty of SNMP and expect scripts.

Top Down Approach

I’ve spent a lot of time thinking about how to go about network monitoring in an efficient way. This is a method that I’ve implemented in a few different types of networks (ISPs, enterprises, live events) and I feel it’s a very focused way of gaining insight from network metrics.

The key idea to keep in mind when building a network monitoring system is to focus on the business goals – what does my business / event / network aim to achieve? For most ISPs, this will be a measure of quality or performance for their customers. It’s up to you as the system designer and business analyst to figure out what constitutes a good experience and design your monitoring to detect when that goal isn’t being met.

Metrics

Due to the limitations mentioned above, it’s not possible (in most cases) to collect every possible metric from every device, therefore a subset must be carefully selected. Start with your business goal and decide which metrics are required to accurately measure the state of this goal. For example, say an ISP wishes to ensure that their subscribers get the best performance possible during peak usage time – this can be measured through oversubscription ratio and uplink utilisation the terminating device. The required metrics are: number of connected sessions, ifHCInOctets, ifHCOutOctets, ifHCSpeed of the uplink from the terminating device. Assuming a known average subscriber speed we can calculate an oversubscription ratio and uplink utilisation (in percentage to normalise for varying uplink speed). Repeat this process across all aspects of the network that might affect business goals to build a list of metrics and measurements.

Alerting

With focused measurements calculated, business rules can be applied to create low-noise, useful and actionable alerts. There is nothing worse as an operations engineer than being woken up in the middle of the night for something that can wait until the next day. Build a policy that defines exactly how to classify alert priorities and then map an appropriate alerting method to each priority. For example, free disk space reaching >80% doesn’t always justify an SMS and can easily be dealt with via a ticket or email to the appropriate team. Avoid too much noise from high priority alerts and save instant communication methods for potentially service impacting conditions requiring immediate action.

Continual Improvement

An effective monitoring system should be a key part of ensuring that your organisation is taking part in continual improvement. Whenever an unexpected outage or event occurs, go through the usual root cause investigation and perform those few steps to isolate the symptoms that indicate the issue and define appropriate priority alerts to prevent reoccurrence.

Monitoring for the rest of us

I wanted to keep this post at a high level and dive into the “How” in future posts but to give some guidance on how to go about this process of effective monitoring, I have a few recommendations.

Use the technology available to you

Most organisations will already have some sort of monitoring application in place (be it commercial or otherwise) and most platforms have enough knobs and dials to be coerced into behaving as I’ve described above. Take a step back, do the high level analysis of your business and network goals and then think about how you can use what you already have to structure an effective monitoring solution.

Design for Resilience

An an operations engineer, I am well aware of the “flying blind” feeling when monitoring stops working – this is not a comfortable feeling. Design your monitoring system to be as resilient as possible against network and underlying system outages. Even if you can’t get to monitoring data during an outage, ensure that metrics are preserved for later root cause analysis.

Slide 15 – Demonstrating a decentralised collector architecture.

This can be achieved by decentralising the collection mechanisms and placing collectors as close as possible to the data sources – this reduces the number of dependent systems and devices that could affect the metric collection process. Storing information centrally is key to doing effective analysis and thus a means for bringing decentralised collector data into a central database is important. Queues and buffers are an excellent way to ensure persistence if the connection between a collector and central store are disrupted – many software platforms have support for Apache MQ, Kafka or other internal buffering mechanisms.

As mentioned previously, limitations on current software and hardware prevent operators from collecting every single statistic possible, however if you can store more than the minimum, then do so for future analysis. Ensure that any raw metrics used for further calculations are also stored as this can be key in doing historical analysis on new trends.

Meta monitoring is important – ensure your monitoring system is operating as expected by ensuring there is current data, services are in the correct state and network connections exist on expected ports. There is nothing worse than having a quiet day in operations to realise that monitoring hasn’t been working correctly.

Future

I am really passionate about monitoring and I intend to go into further technical detail with reference to specific technologies in future posts. I presented this talk to help me construct and distill my ideas into something cohesive and I feel that I’ve learnt a lot about my own processes as I went.

I’m lucking in that my current employer allows me a lot of freedom to be innovative and build an effective monitoring system to drive continual improvement in the organisation. Anything specific to this project will be published on our blog – the first post of which is here: Metrics @ IX Australia.

9 February, 2017

New Job, New Location, New Goals

Tim Life Career, Job, Sydney 2 Comments

Around the beginning of last year I left my awesome position at Zettagrid with the goal of completing my Masters of Software Engineering degree at UWA. I worked part time at UWA and managed to complete what was a pretty tough year academically. At the end of last year I made the decision to move to Sydney in the new year. Alongside my girlfriend, who is starting a very prestigious PhD program in 2017, we’re in pursuit of what’s to come next.

I have taken a new position with the Internet Association of Australia (IAA) as a Peering Engineer to operate IX Australia, the largest DC and ISP neutral peering fabric in Australia. I will be based over east with a mandate to operate and grow the fabric and engage with the IAA membership to improve the range and quality of services offered.

As of this morning, all our belongings have been collected by removalists and come Saturday, we start the 5-day drive from Perth to Sydney. We thought this would be a brilliant opportunity to see Australia and move the car in one go. Hopefully in the new year, my job will allow me more R&D scope and I plan to be blogging much more on the technologies I encounter and experiments I conduct.

All our belongings fit into 5 cubic metres! Time for new beginnings! — All our belongings fit into 5 cubic metres!

19 December, 2016

Appearance on The Packet Pushers Podcast

Tim Uncategorized 2 Comments

I had the honour of appearing as a guest on The Packet Pushers Podcast for their Modern Networking series. We recorded an episode with a couple of other guests, both from higher education / campus network backgrounds. We discussed everything from campus network design, monitoring, automation and even the involvement of enterprises in the IETF.

It was a great discussion that raised many current issues around industry mindset as well as current areas of progress. I particularly enjoyed the discussion around monitoring, metrics and streaming telemetry. I believe this is a key area within which the industry needs to make major progress to advance our state of the art.

While recording the episode, we ended up talking for over two hours resulting in a two-part series. Listen to Part 1 now with Part 2 set for publication in a week’s time. The Packet Pushers run a really tight ship and produce some awesome, quality content. Follow their community blog and subscribe to their podcasts on iTunes or Google Play Music.

27 July, 2016

Network Monitoring @ RFLAN 54

Tim Network Monitoring, RFLAN, Switching 2 Comments

This blog post was originally published on the RFLAN – Redflag Lanfest Facebook page and has been re-syndicated here with only opening remarks added.

About a year ago I joined up with the Infrastructure Team at RFLAN to assist in running what we believe is the largest LAN Party in Australia. With this brought many technical and logistical challenges I have really enjoyed. Below is an overview of the monitoring infrastructure the team and I created for the last RFLAN event.

After laying down many kilometres of cable, configuring services and monitoring infrastructure, RFLAN 54 was set to go. Based on the valuable feedback from the Lanners through the post-event survey it was identified that we needed better monitoring and insight into the event network. We believed that broadcast storms caused by layer 2 switching loops were the cause of poor network performance and set about creating detailed monitoring systems to hone in on the root cause.

The team chose a combination of CollectD for polling SNMP values from network devices (core and table switches) as well as querying other useful information from the wireless controller, DHCP and DNS servers as well as game servers. CollectD pushes metrics into a time series database called Influx which allows display with beautiful dashboards created in Grafana.

Flow chart showing the collection of monitoring tools and how they are linked together. — *Here is the flow of data as it moves through our monitoring stack consisting of CollectD, InfluxDB and Grafana.*

The key lesson we learnt from previous events was to ensure we knew which problems we want to avoid and to collect metrics to aid diagnosis when they do occur. RFLan runs with a large, single layer 2 broadcast domain with up to 800 hosts sharing the same VLAN; this is due to the nature of game server auto-discovery. This configuration can cause widespread disruptions when layer 2 loops occur; such as when lanners bring their own network devices that have not be configured correctly. Additionally, these devices are not visible to the network admins and thus causes the job of diagnosing issues to take much longer. For identifying the source of network loops, we collected all broadcast, unicast and interface counters and crafted dashboards that would show us exactly when and where a loop was occurring. Unfortunately, due to limitations of the desk switches we use, we can only narrow the issue down to a single table switch and not to a set of ports directly.

During the event, the Network Operations Centre (NOC) was notified of Lanners receiving multiple disconnects from gaming sessions, a classic sign of a network disruption. Upon inspection of the graphs a single table switch was identified to be flooding in excess of 20,000 packets/sec of broadcast traffic which was then replicated to all other table switches creating a broadcast storm. After shutting down the identified table switch, admins slowly began the process of bringing each port online until the looping lanner was identified. Upon removal of a foreign device connected to a table switch port, the network returned to normal and operated as expected.Here is the dashboard we created that highlights exactly the location of the broadcast storm based on the number of received broadcast packets from table switches.

A graph showing the link bundles with their received broadcasts statistics. — *All the link bundles showing their received number of broadcast packets per second.*

A graph showing a large peak on a single link bundle indicating a high rate of received broadcasts. — *The identified bundle narrowed down to the time of the network disruptions.*

Below are a few of the awesome dashboards the admins created before and during the event to provide heads-up information across all the critical services of the event.

Internet Bandwidth

DNS Infrastructure

A Grafana Dashboard with DNS statistics. — *Using an application called DNSDist that allows us to load-balance upstream servers and pull all of these useful statistics.*

Wireless Infrastructure

A Grafana dashboard showing wireless statistics and gauges. — *Pulling statistics from the Ubiquity Wifi Controller with custom-built scripts.*

DHCP Leases

NOC Dashboard

A dashboard showing an overall view of all event systems. — This was the main dashboard the senior network admins used to monitor the overall health of the network. This was run on a 4K TV driven by an Intel NUC SFF PC.

13 April, 2015

Thoughts on the Network Services Header IETF Draft

Tim NFV, SDN IETF, Network Services Header, NFV, NSH, SDN 0 Comments

The IETF currently has a draft specification with the Network Working Group that defines a new standard, the Network Services Header (NSH), for defining how Network Service Chains are controlled through a network.

The NSH concept aims to provide a means of constructing service chains that allow network administrators to define paths through the network utilising policy to ensure that classifications of traffic are treated in a certain way. NSH falls into the arena of Network Function Virtualisation (NFV) where services on the network (firewalls, load balancer, DDoS scrubbers etc.) can be dynamically connected together to form service chains. NSH aims to decouple a service chain from the topology supporting it by adding a Network Services Header before the transport header within a packet. NSH aware devices can perform actions based on the Network Services Header information while NSH-unaware devices simply forward the packet based on the outer transport header alone.

After an enlightening conversation with Greg Ferro (@etherealmind) over Twitter, we highlighted a few issues we see with this proposed standard in relation to it’s foundational construction and how it interfaces with existing SDN and NFV concepts.

1990s Solution to a Modern Day Problem

In the early days of networking it was perfectly alright to solve a new problem by adding another protocol to the stack, be it an encapsulation mechanism or a control plane protocol. Recently with SDN gaining popularity, people aren’t interested in making new data plane or control plane protocols that rely on per-device implementations as these will most likely prevent adoption. The advert of controller-based networking means that out-of-band control mechanisms such as OpenFlow, Open Daylight and the like are a much more scalable means of dictating policy on a network rather than another level of encapsulation.

Semi-Distributed Nature

The Network Services Header draft mentions some form of control plane protocol which falls outside of the draft specification making the proposed standard a semi-distributed protocol. When attempting to dictate policy within a network I personally feel that a totally centralised model will mean less overhead for network administrators and will enable easier adoption of the concept. For policy to be enforced the network does however need a way to identify and classify traffic flows – I don’t think it is entirely necessary however to add an additional encapsulation when control protocols such as OpenFlow already provide a means for traffic classification based on a centralised information controller.

Existing Alternatives

In my opinion, the easiest way to dictate network policy is to use a metadata-based system that utilises existing protocols such as OpenFlow, MPLS, NVGRE and VXLAN* to control and identify network flows and enforce policy through their existing mechanisms. For example, MPLS can be used to transparently push traffic at L2/L3 from one point in the network to the next with a unique label-set for identification. There is no need to have multiple headers for context information for a flow as the label can identify the required policy from within a centralised controller.

*VXLAN is lacking in regard to some meta-data features but I’m assured change is coming.

11 April, 2015

Musings on Network Monitoring

Tim Network Monitoring Alerting, Graphing, Logging, Monitoring, NMS 0 Comments

I’ve recently been spending some time looking into the world of network monitoring. It’s been mentioned many times through a lot of my readings that a strong network monitoring strategy is key to staying on top of your infrastructure. I’ve done much reading, investigating and testing of the various monitoring platforms out there and I’ve come to the conclusion that there is no single application or suite that suits my needs. I did however, find that network monitoring falls into four rough categories:

Alerting
Priority one on my list of features for an NMS is the ability to detect change in device state or trigger at some threshold. Alerting allows an organisation to run a 24/7 operation without paying for a round the clock engineering team. A poorly configured alerting system on the other hand can trigger false positives that may hide actual underlying problems. A good NMS should allow significant control of how and when alerts are raised but shouldn’t be so complex that it takes an inordinate amount of time to configure correctly.

Graphing
Graphing of performance statistics collected usually from SNMP, WMI or via web API provide engineers insight into trends and capacity issues that may eventually cause problems. A graphing system that allows engineers to accurately analyse the data and predict problems and future trends is invaluable for networks that have the potential to grow. Be aware though, some monitoring systems have minimum intervals of data collection that may paint an inaccurate picture of network utilisation when they are too high. For example, most systems cannot detect micro-bursting by querying SNMP counters.

Logging
Network devices spit out a tonne of useful information in the way of syslogs; anything from hardware error notifications, protocol state changes and environmental warnings can be collected through a well configured log management system. If the log collection is fine tuned enough, you can link the logging platform to your alerting system and wake someone up if an error is definitely an indicator of problems. Dumping the logs to disk in text files leaves them unaccessible and therefore not useful. Powerful analysis tools can be used to collect, search and graph logs for outage diagnosis and postmortem root cause analysis which otherwise might be much harder to perform.

Device Management
Last but not least device management features as part of an NMS ensure that engineering teams stay on top of vendor software updates, bugs and support contract agreements. Having all this information in one place makes for easy management of a large number of devices when contracts come up for renewal and new software versions are released. Good device management platforms also keep accurate inventory of the various hardware variations, line cards and modules as well as their serial numbers for warranty purposes.

There will be come product suits that cover all these areas and may suit your needs however, my best advise is work out what features you require, create a matrix and assign priority values to each. Then, once you begin looking at all the options on the market, you have an excellent tool that will assist you to make the best choice. Also keep in mind that two or maybe three different systems may need to be chosen to get the coverage of the four categories mentioned above.

For a good place to start, here is my list of a few of the better options on the market for both Open Source and Paid variants.

PRTG
ELK (ElasticSearch / LogStash / Kibana)
Nagios and it’s variants (OpsView / Icinga / Check_MK)
Sensu
Graphite
Zabbix
Zenoss
SolarWinds
ManageEngine OpManager
Vendor-based NMS (JunOS Space / Cisco Prime)

19 February, 2015

Juniper ACX 500 and 5000 Models Announced

Tim Juniper, Routing ACX, Juniper, MPLS, NFV, Routing, SDN 0 Comments

Today Juniper announced the expansion of their ACX range of Universal Access Routers with the addition of the ACX500 and ACX5000 variants. The ACX range boasts features allowing enterprises and service providers to delivery routing, MPLS and Metro-Ethernet services to the edge of their network in a compact form factor.

ACX500

The ACX500 model is hardened to be used either indoors or outdoors in rugged deployment scenarios and is equipped with a fanless, AC/DC powered chassis. Port configuration comes in a range of mixed copper or fibre 1Gbit ports. The overall throughput tops out at 6Gbit/sec making these perfect for remote wireless or cell tower deployments with a requirement to support L2/L3 MPLS functionality. The ACX 500 is also fitted with a GPS receiver to allow for superior clocking when deployed as part of a mobile backhaul network and to support location based mobile services.

ACX5000

The ACX5000 is a full featured, 1 or 2RU platform supporting a full range of MPLS, Metro-Ethernet and SDN features. The 5000 series comes in two model variants; the ACX5048 or ACX5096 each supporting 48 or 96 1Gbit (SFP) or 10Gbit (SFP+) ports with 6 or 8 QSFP ports supporting 40Gbit connectivity. Additionally, JunOS is run on each device within a KVM compliant hypervisor allowing for seamless OS upgrades (ISSU) by running two instances of JunOS and switching between them as part of the upgrade; similar to the QFX5100 line of switches. Other features include Virtual-Chassis and MC-LAG much like Juniper’s MX Metro-Ethernet routers.

The whole ACX family support integration with Juniper’s automation and management platform: Junos Space to unify the management of configuration and image deployment and to provide a single point of control for network automation.

All in all, the new additions to the ACX family is well placed to delivery enterprises and service providers even more options when building out next-generation networks supporting dynamic and virtualised network loads.

13 February, 2015

Investigating Micro-Bursting

Tim Switching, Wireshark 0 Comments

As modern data networks get faster and faster, we’re staring to see the bounds of Internet traffic demands grow at an alarming rate. This phenomenon has put a lot of pressure on network engineers to monitor their network traffic levels and scale capacity accordingly. We tend to rely on our capacity monitoring tools to let us know exactly how utilized our network is and when we should expect to grow – this is all well and good unless our tools don’t paint an entirely accurate picture.

Most SNMP monitoring tools poll the input and output octet counters of a device at a pre-defined time interval and then calculate an average over that period of time. For example; a device that switches 120MB in 60 seconds equals an average data rate of 16Mbps. This average doesn’t take into account the fact that the interface might have done 100MB in the first 10 seconds (80 Mbps) and the other 20 in the latter 40 seconds (4Mbps). You can see from this example that an average of 16Mbps isn’t even close to accurate for interface utilisation.

A Microburst is simply a very short time period where enough traffic is moving through an interface to hit it’s upper limit and causing packets to be dropped. The time period is often too short to be detected by the usual array of network monitoring tools.

Recently my monitoring tools highlighted an uplink trunk interface that had an abnormal number of interface discards, only after a lot of reading and troubleshooting was I interested to find out if micro-bursting might be the culprit.

Screen Shot 2015-02-02 at 8.13.01 am — A graph showing 30 second averages of the gigabit interface showing discards during peak times.

To to find if micro-bursting was to blame and additionally what was the cause, I broke out a high powered packet capturing server loaded with Ubuntu, tcpdump and Wireshark. The uplink trunk in question was running (according to my monitoring tools) at an average of 200-300mbps in each direction during normal business hours and thus I needed the CPU and disk IO to keep up with minimum of ~60MB/sec of disk IO and a potential maximum of 250MB/sec (1Gbit each way) for many minutes to try and capture a microburst in action.

Multiple 4GB+ packet captures later from a SPAN (mirror) port with tcpdump, I was able to load the results into Wireshark for analysis. As each packet is captured from the SPAN port, it is time-stamped very accurately to allow us to find periods of high utilization. After waiting a while (at least two coffees) for the capture file to load within Wireshark, select “Graph IO” from the Analysis menu to generate a graph of packets, bytes or bits per second based on the timestamps accompanying the packets.

Screen Shot 2015-02-02 at 8.12.35 am — Adjust the Tick Interval and let Wireshark recalculate the graph down to that time interval.

Wireshark allows you to change the time period of each graph interval down in to 1/10^th, 1/100^th or even 1/1000^th of a second to see if you are experiencing microbursts. In my case, I was able to find smaller periods of time where traffic was saturating the 1Gbps uplink.

A graph generated by Wireshark showing micro-bursting activity. — The above graph shows the interface hitting 1Gbps and flat lining for about 2/10th of a second, a micro-burst in action.

Using the x-axis of the graph as a time indicator, I was able to look through the contents of the capture and determine that my caching farm was actually pushing out 1Gbps to some client requests and thus causing the interface drops.

Lesson learnt: use your monitoring tools to gauge network load and predict capacity but don’t take them as gospel when long averages are involved.