The holy grail of monitoring tools is the ability to tell you what is wrong with your network, and, in addition, tell you how to fix it. Even better? The tool could be able to fix the network by itself and bring it back to normal state. But we are not quite there yet…
Tools today do a great job at collecting the data (SNMP, flow, end-user), a decent job at displaying the data, and this is more or less where everything stops. Today, we still rely on humans to troubleshoot and triage a problem by using information from tools and serendipitous engineering intuition and experience.
Below is how a system would look if it was able to detect network and applications issues, understand the root cause, and eventually bring the network back to normal conditions.
In control-system theory, this is called a closed-loop control algorithm. This is how our home temperature is kept at the desired level: we (the users) set the desired temperature value, the thermostat monitors the temperature, and the control system gives commands to the A/C or furnace to turn on and off until the desired temperature is reached. This process is repeated every few seconds or minutes.
Let’s break this down from the network perspective:
We assume that we need to configure our network so that the end-to-end latency experienced by all users is less than 4ms. This requirement could be any type of metric such as latency, bandwidth, MOS score, etc.
Network management tools help you set up and reconfigure your network through a management console or an API. Management tools can be vendor agnostic and they can abstract the dull and nitty-gritty details of device management and configuration. They are the interface between the devices on your network and the configuration you want to apply.
The network is the system under control which delivers the service we need (in this case latency < 4ms). In order to be able to configure the network to deliver this service, it needs to be observable and controllable. In other words, we need to be able to monitor and measure whether or not it delivers what we are looking for (latency < 4ms), and has the ability to configure the network devices to deliver this performance. To put this into perspective, if part of the network is provided by an ISP, then that part has neither observability nor controllability.
Network monitoring enables us to know if the network delivers the required performance (latency < 4ms). If we can’t monitor and measure the network, we can’t fix or improve it. The network monitoring can be of several different types, such as SNMP, flow, and end-user. In this use case that we need to measure end-to-end latency, we would need end-user synthetic monitoring to measure the actual metric that we are trying to satisfy, and SNMP and flow data would help us capture the complete picture of the network.
In a human-driven world (like the one in which we unfortunately still live), we need to follow these steps to achieve latency less than 4 ms:
Today, we have the pieces of this puzzle (monitoring, management, network), but we haven’t been able to make them work together in an automated fashion that doesn’t require human intervention, other than setting the desired latency value to 4 ms.
The main missing piece is compiling the monitoring data to understand what the root cause of the problem is, and what action it should be taken to fix that problem. This part today is carried out by humans that need to correlate information and data from many different sources, use their intelligence and intuition to triage the problem, and then manually use a network management tool to apply the fixes. Consider that in this use case we are trying to meet just one SLA (latency < 4ms). Things become much more complicated if we are trying to meet the latency requirements without affecting the performance of other applications and services delivered over the same network.
I have a background in control and I can draw many parallels between control theory and the use case of a closed-loop control system for networks. Control theory has decades of research and development behind it, however, its principles cannot be transferred that easily to networks due to their uniqueness (large, highly distributed, multi-input, multi-output system with no mathematical description).
Recent advances in data analytics and machine making are helping to move towards a fully automated closed-loop network that won’t require any human intervention. This will eliminate cumbersome and dull tasks from the network engineers’ day-to-day routines and will free more time to spend on productive tasks such as designing networks.
At NetBeez we are working towards a big vision. If you want to learn more about NetBeez request a demo!