Network performance analysis

One of the most important functions of a network monitoring tool is reporting. In a report, the performance and status of any monitored resource, service, or device, for which data can be stored in a database as a time series, can be analyzed and compared with other resources to discover trends, patterns like daily, weekly or monthly fluctuations, or spot underperforming assets. Thanks to reports, you can review graphs, tables, and charts related to network availability and application performance and have a clear view of the status and performance of network and applications.

In NetBeez, there are two types of visualizations that I really like: the global average and the continuous average. While these two terms are specific to NetBeez and could go by different names in other tools, their content and use is universal.

Global average

In the global average chart, performance averages across all monitoring sensors are compared with each other and represented as a histogram. Each bar in the histogram represents a test run by one individual monitoring sensor for a specific period of time. If the time window selected is large enough, like one week or one month, the computed average is a good representation of performance baseline of the location where the sensor is installed. Sounds like a great way to capture the end-user experience! Not only that, but this visualization also allows the network administrator to identify the best and the worst sensors and act on this data. A follow-up action could be tune and optimize the network configuration at those locations that underperform, perhaps enabling quality of service, or sometime even something more radical, like replace the network hardware.

Let’s take, for example, the global average report of a target pointing to Google. In this target, I have configured 23 network sensors to run ping, DNS, HTTP, and traceroute tests on the www.google.com URL.

In the global average report and associated table (truncated) below,  I compare the one-month average round-trip-time, packet loss, and number of alerts generated by the ping tests run by the sensors set to monitor the target.

As you can see, I have lots of useful information grouped in a easy-to-read report. Here is some of the information that I got:

  • The average network latency to Google from all my network locations is 37.5 ms
  • Athens is the location with highest latency, about 63 ms
  • Overall network latency from all locations is good because it’s less than 150 ms
  • The Pittsburgh WiFi sensor is the one that has highest packet loss and number of alerts generated; as a follow-up action, the wireless engineer could, for example, perform a site survey to verify WiFi coverage

Continuous average

The continuous average is another useful graph that should be checked at least on a weekly basis by network administrators. In a continuous average chart, I can review trends and changes in test performance data. This visualization is a great instrument to view daily or weekly performance trends, such as increasing network latency during business hours, or correlation between network time and application response time.

Once again, let’s take the aforementioned Google target. This time, I have selected in NetBeez the continuous average for a one day period. What you see plotted below is multiple lines, where each line represents the hourly average of a ping test to Google from one sensor.

As you can see, the network latency of some sensors is increasing during business hours. You probably you can’t read the names, but I can tell you that the sensors with high latency are located at the same location. What we see here is a problem known to network engineers: oversubscription of bandwidth. When a router is servicing more traffic than it can handle, it will start caching packets in its queues to give the opportunity to all the packets to be transmitted. This causes the behaviour that you see  plotted here, increased latency, and, if the rate of incoming packets is consistently higher than that of the outgoing interface’s, the router will start dropping packets, causing retransmissions for TCP connections.

How can we determine whether increased network latency is good or bad for our applications and end-user experience? By analyzing the continuous average of the HTTP tests to Google, we can see whether the increased response time is within the range of acceptable values.

In this example, we see that network latency is clearly impacting the response time of the Google.com search. For most parts of the day, the response time is less than half a second, excluding the peak around 5PM, where this value reaches one second, a value that should be the upper limit for an HTTP GET.

In the end, it’s up to the network engineer to determine if such a value is acceptable or not for its users, and whether the Internet connection should be upgraded or QoS should be implemented in the network.

Conclusion

Network performance reports are essential for delivering an excellent end-user experience to network users. Periodically, they should be reviewed and analyzed to improve network performance, availability, and catch suboptimal configurations that could be the cause of future service disruptions.