dilbert_network_downAgent-based network and application monitoring has been gaining more and more attention due to the increased complexity of enterprise networks and the services they support. Nowadays, a typical medium to large company has users in multiple geographical locations that use web applications that might be served by an on-premise data center, a private cloud, and a cloud-based SaaS provider.

SNMP limits

It’s not enough, any longer, to monitor your WAN infrastructure by traditional up/down SNMP-based monitoring tools to guarantee network and application SLAs. Since your users rely on such a wide spectrum of heterogeneous networks, what counts, at the end of the day, is what they experience at the end of the pipe. You should be able to detect in a vendor- and network-agnostic fashion when a problem comes up and be able to tell how many users and at what locations experience the problem.

 

It’s not surprising that 83% of IT professionals surveyed say that the number one problem with application management is “determining whether problems are caused by the network, the system or the application.” Here is where agent-based distributed network monitoring comes into play. In a previous blog post, I lay down what are the 5 most important properties of distributed network monitoring. In a nutshell, distributed network monitoring uses agents (hardware or software) deployed at each network locations in order to simulate end users and capture and report their experience.

 

At NetBeez, we have been working with network engineers that face exactly these types of challenges, and we have learned firsthand how they use distributed network monitoring to detect end-user issues faster, get more data to determine the root cause, and make their lives easier.

Here are the top 5 uses cases of distributed network monitoring:

Use Case 1 – Monitoring Service Delivery

The goal of both the application and network groups is to provide the users with high availability and quality applications. Consequently, it is critical for both teams to know as soon as possible about a user performance or interruption problem in order to jump on it. In addition, it’s never pleasant to have to collect data about the problem either by calling users, sending boots on the ground, or RDPing to remote workstations. Having all necessary information pre-collected and readily available is of outmost importance.

Use Case 2 – Configurations Change Validation

Even the least critical configuration changes (routing, content filtering, firewalls, etc.) are done during off hours in order to minimize impact in case something goes wrong. IT staff do their best to verify that their changes don’t cause any interruptions, but there is no shortage of campfire horror stories that have cost jobs or connectivity to 11 million people. Imagine if, during your configuration changes, you had an agent at each location set to instantaneously report if it lost connectivity to your datacenter or that the content filtering is no longer working. You can fix the problem on the spot and go to sleep worry free without somebody waking you up at 7:30 am to go back and fix whatever you broke last night. Let alone all the users waiting at their desks for the network to be repaired in order to start their day.

Use Case 3 – Performance Measurement and Analysis

Apart from using the agents during troubleshooting and configuration changes, you can also collect data and build performance profiles. This way you know how your users experience the network and applications over a long period of time. Knowing which locations experience more outages every month or which users suffer in terms of performance helps to plan for IT strategic upgrades and budgeting.

Use Case 4 – Local vs. Global

A typical help-desk procedure is to receive calls and escalate to the network or application group when there is a critical mass of tickets (usually 3-5) for a specific problem. The whole process is human driven and error prone since users report their experience through a subjective filter (e.g. an application being slow might be reported as “slow” or “not working” depending on the user’s frustration level). On the contrary, data collected by the agents include exact response times, when the problem started, when it stopped, if it was intermittent, etc. For the help-desk, it becomes very easy to see if a problem is local or global, and escalate the problem without having to wait for actual users to pick up the phone and call. On top of that, the help desk can be even bypassed by the IT group altogether, shortening the time to detection and time to repair.

Use Case 5 – It’s Not the Network, Stupid!

Let’s face it. The network gets blamed first for everything that’s wrong in this world! And quite often, the network group’s first reflex is to try to prove that it’s not the network in order to hand off the ticket to the application group. A straightforward way to do this is to run a bandwidth tests from agents installed at the locations that report the problem. If the bandwidth is up to the standards then you can start looking for other possible root causes of the problem other than the network. It could be the application, the user’s workstation, and many other things, but… NOT THE NETWORK!

Bonus Case Use – Proactive Monitoring

It’s not uncommon among our customers to know about end-user problems before the users actually experience them or decide to report them. For example, an intermittent problem might stay under the radar for a while and nobody might report it until it becomes annoying enough or it evolves into a more serious disruptive issue. If an agent is able to detect and accurately report network and application problems in fine-grain detail, then these kinds of situations can be avoided. The network and application groups can be proactive and can fix problems even before users know about them.

Conclusion

There are many more use cases we have collected (e.g. VPN tunnel drop, Wi-Fi issues, jitter, etc.), but the 6 listed above are the most common and important ones. We believe that up/down monitoring is no longer adequate for satisfying the needs of the network and application groups in the era of cloud and SDN.

If you have any similar use cases that are related to distributed network monitoring, I would like to hear from you.