Troubleshooting network performance issues is both an art and a science. I say that because computer networks are distributed and intrinsically complex. For this reason, many factors could lead to a network performance issue. As a result, to be good at troubleshooting, you need to master many skills. For instance, you must have knowledge about:
- how computer networks, including the TCP/IP stack, and network protocols work,
- what systems should be inspected during the troubleshooting effort,
- what tools can be used to collect diagnostic and performance data,
- how the network is designed, operated, and where you can find its documentation.
Furthermore, employing an unconventional, intuitive approach can be invaluable in minimizing the time spent on troubleshooting. Shorter troubleshooting periods translate to fewer disruptions for businesses.
Indeed, subpar network performance not only affects end-user experience and employee efficiency but also increases support costs and downtime expenses. This holds particularly true in sectors with critical operations, such as healthcare, financial services, and transportation, where costs can quickly escalate.
Common Network Performance Issues
Let’s do a quick review of most common performance issues that impact connectivity and the end-user experience. On a daily basis, network engineers encounter various problems, but some of the most common network performance issues include:
- Network slowness (throughput)
- Application failures
- DNS failures
- Wireless issues
- Hardware failures
- Network security incidents
- Network changes
Information Gathering Phase
Before starting troubleshooting network performance issues, it’s important to collect all the necessary information about the problem itself. In this regard, journalists use a good framework to gather information called the Five Ws. Here are the five associated questions you should ask yourself:
- Who impacted? Describe who are the users being impacted.
- What happened? Describe what is the problem, if it’s a complete loss of network connectivity/service (black-out), or a performance degradation (brown-out).
- When did it happen? Describe when the problem occurred, if it’s still ongoing or not.
- Where did it happen? Describe where in the network the issue happened.
- Why did it happen? This is the only question that will be answered once the troubleshooting effort is successfully concluded.
Once you have acquired most of the initial information, you can start troubleshooting.
When troubleshooting, it’s important to use a methodology that you understand and can easily follow. There are three methods that are well known:
- Top-down approach: This approach follows the OSI stack, and troubleshooting starts from the top application layer and goes all the way down to the physical one, until the root cause is identified.
- Bottom-up approach: This is the reciprocal of the top-down approach but starts from the bottom layer, the physical one.
- Follow the path approach: This approach focuses on the actual path that the broken, or degraded, connection traverses, starting from the end-user point of view and follows, hop by hop, the route of the impacted traffic. In this approach we basically start at Layer 3, and then stop at a network segment, or router, where we have identified the problem. Once there, we could pick one of the two previous approaches (top-down or bottom-up).
During your investigation, you will need to gather different performance metrics that will help identify the root cause. To do this, we will group performance metrics based on where they sit within the OSI stack. The following table is not complete, but lists the most important metrics. Moreover, for sake of simplicity, I merged the Session and Presentation layers into the Application one.
|Layer||Performance metrics (non exhaustive)|
|Physical||Interface statistics such as transmitted and received bits, errors, etc. L1 metrics vary based on the physical layer observed (copper, fiber, wireless, etc.).|
|Data Link||Data link statistics such as transmitted and received frames, retransmissions, errors, etc. Also in this case, L2 metrics vary based on the physical layer observed (802.3, 802.11, …). In this layer I will also include 2.5 layer statistics, such as VLAN, MPLS, etc.|
|Network||Latency, packet loss, jitter, …|
|Transport||Throughput, TCP retransmissions, UDP packet loss, …|
|Application||Application logs, errors, payload traces, …|
Diagnostic Tools and Monitoring
Having access to the right tools and systems is equally important as having a good knowledge of network protocols. The following list includes some of the most commonly used tools and systems that help troubleshooting:
- Physical medium testers such as fiber optic testers and WiFi spectrum analyzers.
- Network monitoring tools, such as SNMP pollers, flow analysis protocols, or synthetic monitoring tools such as NetBeez.
- Network management tools, including network devices’ console.
- Packet capture (e.g. wireshark or tcpdump).
- Log inspection tools and aggregators.
- List of most recent network and application configuration changes (in case of human-generated outages).
In the realm of network performance troubleshooting, there are indispensable prerequisites: knowledge, expertise, and the utilization of appropriate tools. These essential components are pivotal in ensuring a successful resolution of network-related challenges. In conclusion, while knowledge, experience, and the right tools are the cornerstones of successful network performance troubleshooting, a well-structured approach that begins with thorough data collection and embraces a methodical methodology is the key to swift and accurate issue resolution.