Simulating network outages is an important step in validating highly available and/or redundant network configurations. Along with validation, another use case for simulating network outages is to test the resiliency of mission-critical systems and applications. The simplest example of a test-driven network outage is to turn off the primary uplink connection of a dual-homed router and analyze the results. With appropriate tools, you can verify that the network traffic quickly re-converges to the secondary link with minimal packet loss. In more complex scenarios, the network simulation may need to introduce packet loss, increase in round-trip time, bandwidth saturation, DNS and DHCP failures. To implement these failure scenarios there are different tools that can do the job. I’ll briefly review some of them, particularly the open source solutions.
How do I simulate a hard-down or node failure?
Simulating a hard-down or node failure is fairly easy. All you have to do is turn off the node itself to verify how the system responds (system re-convergence). If high availability is enabled, a secondary/standby node will immediately take over and become the active node, servicing user requests.
How do I simulate packet loss?
Packet Loss is calculated as the percentage of packets that are received malformed, or not received at all, at a destination. This important performance metric is a valid measure of network quality that reflects the reliability of a network infrastructure. To simulate packet loss, you can use an open source tool called tc (traffic control). tc can be installed on any Linux host. With this tool you can introduce packet loss to a network interface, as well as increase in latency and round-trip time. Based on the test plan, you can apply packet loss at the source, destination, or at an intermediate host between the two end-points. If you want to learn more, read the blog post that Panos wrote on how to use tc.
How do I simulate a routing change?
Routing changes are variations of paths between a source and a destination. To simulate a routing change you can turn off an intermediate node, or interface that is traversed by the traffic under test. To do this (and avoid unneeded outages), you may want to set up a network lab that is separate from the production network. If your application or service is public, then you can just rely on the Internet itself. If you connect the source and destination hosts to two different service providers, you’ll most likely observe frequent routing changes that can be used as testing scenarios. You can use the traceroute command to verify that routing changes are actually happening.
How do I simulate bandwidth saturation?
The best way to simulate bandwidth saturation is to flood the network with traffic. To do this, you’ll need a computer that acts as a packet generator and is equipped with a high speed network interface (NIC). Most of today’s computers are equipped with a 1 Gig Ethernet NIC. You can also purchase a 10 Gig or a 100 Gig NIC. You have to make sure that the packet generator is able to transmit traffic at the throughput that is required in order to saturate your network. From a tooling perspective, you can use iperf. Iperf is an open source tool that generates TCP or UDP traffic, and that supports both Windows and Linux. Just make sure that the NIC is supported by your operating system. Also, please be advised that iperf support for 100 Gig traffic is something that remains experimental. You can check this resource on the ESnet website. If you want to get “up to speed” with this tool 🙂 watch the recorded webinar that Panos did on how to use iperf.
How do I simulate a DNS error?
To simulate DNS errors, you can either query a non existent DNS server, turn off the DNS service (in a lab environment) on the server, or query a non existent DNS record. While these solutions lead to the same outcome (the host can’t resolve a specific DNS name), the DNS error code is different. Also, an application may behave in different ways.
If the server is up, but the DNS entry is unknown, the test will fail with couldn’t resolve host code as soon as the reply is received from the server.
However, if the server is down, the DNS query will eventually wait for the DNS lookup timer to expire before declaring a timeout error. It’s important to consider if both scenarios should be applied to your testing plan.
How do I simulate a DHCP error?
The DHCP service is responsible for assigning dynamic IP addresses to network hosts. If this service is not working correctly, the users won’t be able to connect to the network. In the WiFi timing article I previously wrote, I describe how the DHCP negotiation (D-O-R-A transaction) between a client and the server is structured. iptables is a host firewall for Linux machines that allows you to manipulate the network traffic that is processed by the local host. With iptables you can filter specific D-O-R-A packets exchanged between the DHCP server and its clients. This allows you to simulate different DHCP failure scenarios.
Conclusion
Simulating network outages is needed to ensure that high availability is correctly implemented within a network or application. There are many open source tools available to implement most common network outage scenarios, such as tc, iperf, and iptables.