What differentiates good Internet Service Providers (ISP) from bad ones? Whether or not they meet their service level agreements (SLA). An SLA is a contract between a service provider and a subscriber that defines what level of performance is expected from the service provider. An ISP that doesn’t respect their SLA will deal with angry customers, have subscribers switch to other competitors, or, worse, deal with lawsuits.
In a survey performed by the US Federal Communication Commission, it was found that one out of three Internet users switched broadband provider because they were looking for better price or performance. This number was high enough to draw the attention of ISP executives, who expect SLA verifications to be implemented by the network engineering team. As result, network engineers are always looking for accurate and efficient ways to verify and assure that subscribers are getting the performance they were sold.
If you work for an ISP, you already know that the best way to verify that the SLA is met is to periodically run network performance tests to verify parameters such as network latency and speed. This may sound like an easy task, but it generally isn’t, because the main problem is that for SLA tests to be accurate, they need to be run in production, in similar conditions as the customers’. And this poses many challenges.
ISP networks are large and complex. They are large because they serve a large number of customers. They are complex because are extremely diverse, with different types of underlying connections and third-party carriers that enable the ISP to extend its reach to locations where it doesn’t have network infrastructure. On top of that, their networks are geographically distributed, oftentimes covering multiple US states.
The best way to cover all the bases is to deploy network sensors in sample locations. The deployment of each network sensor should reflect the performance of all users in that area. The best way to do so is to make sure that the SLA tests that the monitoring sensors run cover the same network infrastructure components that users in that area use. This way, the ISP can detect outages or performance issues caused by the network infrastructure itself.
Once the deployment dilemma has been solved, the network engineer has to set up monitoring tests on the sensors. Here is a list of most common SLA tests that are generally configured.
Network round-trip time (RTT). This is the time that it takes for a packet to go back and forth between two hosts. This test can be executed with a simple ICMP echo request/reply test via the command line utility ping. Some networks deprioritize ICMP traffic with a quality of service (QoS) configuration. In this case, the test can be run with TCP traffic with a command line utility like hping3. If TCP is selected as measurement method, the round-trip-time should not be calculated with the SYN SYN-ACK transaction time, as some firewalls may interject and inspect the packets, introducing further delays in the measurements, which are not caused by the network itself. For real-time applications to function properly, network round-trip-time should be less than 150 ms. Also, when monitoring RTT, its lowest value should be used as benchmark.
Packet loss. Packet loss is the percentage of packets that are lost in a given period of time. Also in this case, packet loss can be calculated with ping. An alternative way to generate an accurate estimate of packet loss is to generate a UDP stream using the Iperf utility. Packet loss should never be higher than 5%, otherwise applications will start having performance issues.
Path MTU. This measurement reports the size of the largest packet that can traverse a network path without being discarded. Generally, 1500 byte packets should be allowed to traverse the network. If that’s not possible, some applications may have performance issues or completely stop working. This value can be calculated with the traceroute utility in Linux. If you want to learn more about troubleshooting MTU issues, I previously covered this topic here.
Jitter. This value measures the variation in the delay of received packets. High jitter will degrade real-time applications, like voice-over-IP (VoIP) or video streaming. VoIP devices implement jitter buffering algorithms to compensate packets that arrive at high timing variations, and packets can even get dropped when they arrive with excessive delay. A tool like Iperf can provides these measurements. Panos wrote a good article on the impact of jitter on VoIP calls in this blog post.
Download and upload speed. The transfer rate, generally expressed in bits per second, at which a certain amount of data is exchanged between two hosts. This value depends on the type of line (ISDN, ADSL, Fiber, etc.) and plan a subscriber has selected. Download and upload speed can be measured along the ISP network infrastructure with Iperf or SpeedTest all the way up to an Internet server. When measuring download and upload speed, packet loss should be less than 5% and jitter less than 150 ms, otherwise, you won’t have an accurate measurement. Also, the two testing devices that are involved in the measurements, both on sending and receiving sides, must be capable of processing equal or higher data rates than the actual speed of the network.
Measuring SLA for ISPs is easier said than done. When planning for an SLA monitoring system and methodology, there are many variables that must be accounted for. I hope that this short post gave some high-level directions on how to proceed and some insight to keep in mind. If you want to explore the topic more deeply, here are three Request For Comments (RFC) that you can review:
- RFC 6349 (https://tools.ietf.org/html/rfc6349), which offers a framework for throughput testing
- RFC 2544 (https://tools.ietf.org/html/rfc2544) discusses and defines a number of tests that may be used to describe the performance characteristics of an interconnecting network device
- RFC 4821 (https://tools.ietf.org/html/rfc4821), which describes a method for Path MTU Discovery