Network troubleshooting is the process of acquiring information and collecting evidence to identify the root cause of a network outage or performance issue. The goal of network troubleshooting is to resolve the problem as soon as possible by performing corrective actions on the root cause. In this article, we will cover network troubleshooting techniques and possible tools that aid the process.
Why Network Troubleshooting is Important
Network troubleshooting is a key aspect of network management that requires investment in tools and resources. Efficient network troubleshooting offers several advantages to organizations, such as:
- Reduced downtime – Efficient troubleshooting will reduce the amount of downtime that digital services and information systems experience.
- Cost savings – Network outages cost money due to interrupted services, SLA penalties, and customer complaints that increase cancellations. The shortest the outage, and the smallest the monetary impact of it.
- Improved performance – Identifying bottlenecks, performance issues, and other network degradation problems translates into a better performing infrastructure and digital experience.
To achieve these benefits, organizations need to consider three main elements:
- The network monitoring tools they use,
- the support personnel that handle network incidents and trouble tickets,
- the processes and escalation procedures adopted.
Network Monitoring Tools
In the context of network troubleshooting, network monitoring tools provide alerting, diagnostic data, network performance metrics, logs, and statistics. During the troubleshooting process, support personnel may need to review data from different sources, such as:
- SNMP pollers – Provide the status and diagnostic data on network devices; SNMP helps identify events such as hardware or link failures, software errors or bugs, and anything else that could affect a network component.
- Passive analyzers – These tools help identify bottlenecks caused by one or more devices saturating a network’s bandwidth or specific portions of it; they can also inspect sequences of packets to pinpoint performance issues between a client and a server.
- Active monitoring tools – Active measures include end-to-end reachability, round-trip-time, packet loss and other network metrics; tools like NetBeez alert on network performance degradation issues, and collect metrics around the end-user experience of network services and applications.
Since each tool type provides information about a specific aspect of the network, companies should have each one of them in place.
Support Team
The significance of human resources in network troubleshooting cannot be overstated. Networks, the digital backbone of modern organizations, depend fundamentally on skilled individuals to function optimally. Without dedicated professionals who possess the knowledge and expertise to diagnose and resolve network issues, troubleshooting simply cannot occur effectively. Therefore, organizations must prioritize investments in their workforce, offering continuous training and fostering a workplace culture that values and retains employees. By doing so, they not only ensure the seamless operation of their networks but also empower their teams to proactively address challenges and drive innovation in the ever-evolving field of network management.
Support Tiers and Escalations
Support teams are often organized into tiers to efficiently manage and resolve customer issues. Many organizations adopt a three tiered approach, organized the following way:
- Tier 1 handles initial customer inquiries, offering basic troubleshooting and solutions to common problems; the Tier 1 team is mostly composed of help desk or support agents.
- Tier 2 consists of specialists with deeper technical expertise who take care of more complex or unresolved issues; generally Tier 2 personnel are also operating within a Network Operations Center (NOC).
- Tier 3, the highest level, tackles the most intricate and critical issues, often involving complex systems or network configurations; in this tier we generally find Network Engineers or Architects whose primary responsibility is the design and implementation of network solutions.
This approach streamlines the support process, ensuring that each team focuses on issues that match their skill level. Ultimately this organization leads to quicker problem resolution, improved customer satisfaction, and lower support costs. When analyzing and optimizing support costs, consider three factors:
- number of tickets,
- average time spent on tickets, and
- number of ticket escalations.
These three metrics are directly proportional to support costs. One easy way for organizations to reduce costs is to adopt network monitoring tools that enable lower tiers to troubleshoot issues that would have to be escalated to the higher, more expensive, tiers.
Troubleshooting Common Network Issues
In an ideal world, the majority of network issues are detected by network monitoring tools and require minimal network troubleshooting. The reality is that network troubleshooting is a daily activity for most organizations. The following is a non exhaustive list of common network issues that require some degree of troubleshooting:
- Hardware or link failures: These failures should be detected via SNMP and generate an alert; other ways to troubleshoot hardware or link failures is to access a network device’s administrative interface, its system logs, or physically inspect the network.
- Network connectivity: Connectivity issues can happen in different portions of the network, including the Internet; active network monitoring tools and commands like traceroute can help troubleshoot connectivity issues and identify where the failure originated.
- Slow network speed: Sluggish application performance could indicate a bandwidth problem; in this case, active monitoring tools provide a baseline of the bandwidth available, while passive analysis tools can help pinpoint what hosts or applications are saturating the network.
- Configuration issues or network changes: These types of failures can be hard to troubleshoot due to network complexity; in general, it’s a good practice to test a network and its services after a configuration change so as to exclude any unexpected human caused outage.
- Wireless: WLANs can experience localized RF (Radio-Frequency) issues such as low signal or interference, infrastructure issues such as radius failures, or client specific errors; a wifi network monitoring tool will provide the data needed to identify the most common root causes.
- User error or perception: In the case where it’s not the network to be at fault, the support team is still required to show a proof of innocence; network monitoring tools, packet traces, and ping tests will provide the required data to exonerate the network.
How to Troubleshoot Network Problems: A Layered Methodology
When troubleshooting network problems, it’s very important to keep the OSI model in mind and work your way up from the lower physical layer to the application layer. This bottom-up approach helps to successfully troubleshoot network problems because each layer relies on the lower one to function properly. In the following sections, we’ll provide some basics on troubleshooting the first four layers of the OSI model plus a brief note about the application layer.
Troubleshooting the Physical Layer (OSI Layer 1)
The physical layer includes anything that generates and moves bits from point A to point B. This is the network interface layer – such as Ethernet or WiFi cards, fiber cables and the air that enables hosts or computers to communicate with other hosts and the outside world in general. To troubleshoot this layer, the network engineer can use the diagnostic tools that the hardware vendors include in their hardware.
In the case of Ethernet cards, basic diagnostic commands report information on the duplex and link speed that the card has established with the other side of the cable. In the case of a WiFi adapter, the utility should report the signal strength and link quality of the connection established with the base station or ad-hoc peer. This data is important to understand the quality of the layer 1 link established.
To troubleshoot problems with copper or fiber cables you can use a time-domain reflectometer (TDR) or optical time-domain reflectometer in the case of a fiber link. Some networking vendors also include basic TDR functions on their equipment. In the case of WiFi networks, spectrum analyzers are very useful to provide information about “the air” and detect any interferences in the surroundings, such as microwave ovens.
Troubleshooting the data-link layer (OSI Layer 2)
To troubleshoot the data-link layer issues, network engineers can access the command line of a switch to inspect the MAC address table, which provides information about the MAC addresses learned on switched ports. To troubleshoot Layer 2 communications between hosts, network engineers can use passive analysis tools such as wireshark, which is GUI based, or tcpdump, which is command line based. Such tools provide a recording of frames, flowing across a network link, switch or host.
Another important thing to keep in mind when troubleshooting layer 2 issues is the spanning tree protocol. Spanning tree is a Layer 2 protocol that enables switched networks to build a loop-free topology, which happens when redundancy is introduced in a network design. When a network topology has a loop, frames flow indefinitely without reaching its destination host or getting discarded, causing broadcast storms. Broadcast storms saturate network links and cause instability in the CAM (Content Addressable Table) of switches. The spanning tree protocol avoids this scenario by disabling switch ports that cause loops. However, for spanning tree to properly work, all switches in the network must be correctly configured. Getting familiar with the spanning tree protocol and diagnostic commands on switches is a very important knowledge for network troubleshooting.
Troubleshooting the network layer (OSI Layer 3)
The most used commands to troubleshoot layer 3 issues are ping and traceroute. With ping you can verify whether a host can reach a destination network or host. With traceroute you can discover the routing hops available between a source and a destination. When troubleshooting layer 3 problems, it’s important to consider whether the destination host is located within your organization, or not. If it does, then the troubleshooting efforts aim at figuring out whether a network misconfiguration, or something else, is causing the connectivity or performance issues. If, on the other end, the network path to the destination host traverses a third party, then it’s important to provide enough information and prove that it’s someone else’s problem. One way or another, ping and traceroute are two useful commands that shed light on reachability and performance issues between two IP hosts.
Troubleshooting the transport layer (OSI Layer 4)
The transport layer is responsible for ensuring that application data is exchanged between two hosts. TCP provides a connection-oriented option, and UDP a connectionless. At this layer there are several things that could prevent applications from working, so different commands come to play. Here are some of the common causes of layer 4 network issues:
- Protocol settings on the source or destination host, including host firewalls that block inbound or outbound traffic; Windows, Mac, and Linux have a netstat command that reports all open TCP/IP socket connections; to troubleshoot host firewalls, each system will have its own flavor (for instance in Linux iptables is a pretty common option).
- Network firewalls between the source and the destination host that block connection attempts; to troubleshoot if a firewall is blocking a service from working, you can use a command like telnet in the case of TCP or the open source nmap, which has several capabilities and scans available (disclaimer: scanning Internet and third-party hosts without authorization can be prosecuted).
- Overlay networks are sources of MTU mismatches causing some applications to function inconsistently based on the request or payload. Troubleshooting MTU can be done with a ping test by setting the Don’t Fragment bit (DF) and forcing the MTU to a required amount in line with application’s requirements. Traceroute also offers an option to test the path MTU end-to-end.
Troubleshooting the application layer (OSI Layer 7)
Network troubleshooting typically concludes at the transport layer, but it remains crucial to incorporate fundamental application troubleshooting into the procedure. This step ensures that a given issue is not potentially triggered by application behavior. Furthermore, as certain problems may solely manifest within an application context, scrutinizing application logs and conducting tests through the application interface aids in pinpointing the underlying cause of a potential network problem. In cases of application outages where the source of the issue is unclear, whether it pertains to the network or the application itself, it is advisable to engage in collaborative troubleshooting efforts involving both network and application teams, rather than pursuing isolated approaches. This collaborative approach aims to expedite issue resolution by leveraging the expertise of both teams.
Network Troubleshooting Tools
The following is a list of common network troubleshooting tools:
- arp – A command that displays association between MAC and IP addresses within a host or switch.
- ping – A TCP/IP utility used to verify reachability, latency, and packet loss to a remote host or destination.
- traceroute – A TCP/IP utility used to verify hop by hop information between a source and a destination host.
- nmap – An open source network scanner that lists the status of remote TCP ports.
- netstat – A command that lists all the open TCP/UDP sockets on a specific host.
- nslookup – A command that can be used to resolve a host to verify if a DNS server works.
- ipconfig (Win) or ifconfig (Linux/Mac) – A command that displays Layer 1, 2, and 3 interface information.
- tcpdump – A command used to run packet captures from the command line.
- wireshark – A GUI based packet capture utility.
Reduce Troubleshooting Time with NetBeez
NetBeez is an active network monitoring platform that enables operations and support teams to quickly troubleshoot performance issues from the user’s perspective. The solution relies on distributed network monitoring agents that provide end-to-end network and application performance metrics. NetBeez has three key pillars that make it a good solution for network troubleshooting.
Granular Performance Data
NetBeez captures granular network performance metrics to applications and services.
- Performance metrics up to one second interval
- Help isolate with accuracy the exact time and moment when a problem occurs
- Retains historical data to generate baselines, identify trends and recurring issues to perform root cause analysis
Proactive Incident Detection
Netbeez agents run real-time tests, end-to-end, and from the user perspective.
- Continuous active monitoring against networks and applications
- Quick detection and alerting on service failures and performance degradation
- Enforce and guarantee quality of service and SLAs
- Verify and validate configuration changes during maintenance windows
Multi-Platform Deployment
The solution supports flexible deployment options for on-prem, cloud, and remote.
- Deploy the server on-premises as a virtual appliance or in the cloud as an instance
- Support Ethernet, Wi-Fi, virtual, Docker, and Linux based agents
- Support Windows and Mac clients
- Easily orchestrate and deploy at scale
Conclusion on Network Troubleshooting
Network troubleshooting is a key aspect of network management that requires proper investment in tools and resources. Organizations adopt a three tier approach to handle ticket response and escalation. When troubleshooting network performance issues, it’s very important to keep in mind the OSI model and its layers. Starting from the bottom layers and moving your way up will assure that the proper troubleshooting procedures with faster problem resolution.