NetNORAD: Active network monitoring at Facebook

Active monitoring of data centers is nothing new to NetBeez. In a previous post, I showed how leaf-and-spine data centers based on Cumulus Linux can be monitored with NetBeez. I also wrote about how Microsoft is monitoring its data centers with an active probing system called Pingmesh. You may not be surprised to know that Facebook is doing something similar with an in-house system called NetNORAD, which enables the infrastructure team to quickly troubleshoot its massive data center networks via end-to-end probing.

Goals of NetNORAD

Human investigation of network interruptions may take several minutes, if not hours. The short-term goal of NetNORAD is to automatically detect, within seconds, network outages. However, the ultimate goal of NetNORAD is much more ambitious. That goal is to automatically repair, within seconds, network outages to minimize service interruption.

The infrastructure team at Facebook has realized that it may take several minutes for a traditional SNMP poller to detect a problem with a network device and then trigger an automated remediation response. In some cases, the device itself cannot even properly detect and report its own malfunctioning, causing more delays in the detection of the problem. To solve this problem, NetNORAD sees the network as a black box and troubleshoots problems without having to poll information from the network hardware itself, but by sending packets over it.

Facebook network design

Hierarchically, the network design at Facebook is pretty straightforward. Servers are installed in racks, and racks are a grouped in clusters. A data center is composed of one or more clusters that share the same network in the same building. One or more data centers belong to a specific region that is located in a geographical area. Different regions are then interconnected with a high-speed backbone.

Pingers and pollers

NetNORAD has two main components: the pinger, which sends UDP packets, and the responder, which receives and sends back packets with a timestamp. These components run on the servers as processes. Each pinger sends packets with different DSCP values, waits for the response from the responders, and cycles again. From a deployment perspective, each cluster runs a small number of pingers, while each server in each data center runs a responder process.

In NetBeez, a monitoring agent is a standalone device that has both the function of pinger and responder.

Proximity tagging

All pingers share the same global list of servers, which includes at least two servers per rack. That means that an individual target (server) is probed by all pingers. From a pinger perspective, results are aggregated and tagged with DC if they reside in the same data center, REGION if the reside in the same region, and GLOBAL if they reside on other regions. This technique is called proximity tagging.

Packet loss and network latency measurement

In contrast to Pingmesh, which primarily uses TCP-based pings, NetNORAD uses UDP packets to probe the network and measure packet loss and network latency, two key metrics that affect TCP performance. So why UDP and not TCP? Mostly to preserve server’s performance. UDP and TCP have the same forwarding behavior in the network. However, TCP requires more resources on both the pinger (sender) and receiver (poller) sides, skewing measurements and confusing the application monitoring systems.

For each responder, the system calculates the ten-minute percentile. This metric is very important in detecting performance degradation issues, like an increase in packet loss on a specific link. The typical time to detect network performance issues is around 20-30 seconds.

Fault isolation

Proximity tagging is used to understand if a problem is with a specific data center, cluster, or an entire region. The concept is very similar to what we call distributed network monitoring and that enables engineers to identify global versus local failures.

For example, if packet loss is only reported at one cluster in a specific datacenter, then the failure is located within that specific cluster. If, on the other side, all clusters within a datacenter, or all the datacenters within a region, report a failure, then the failure is located at a higher level within the datacenter or backbone.

Proximity tagging enables quick detection of the problem location but not the source of the problem. For this reason, Facebook has developed a traceroute utility called fbtracert.

Fbtracert

Fbtracert is very similar to MTR, which is a traceroute utility that also calculates the packet loss at each hop. Thanks to fbtracert tests executed from multiple pingers, NetNORAD can explore multiple network paths between two end points, correlate the tests results, and determine the failure point. However, like traceroute, this tool is not perfect. For example, fbtracert is not effective when there are frequent routing changes or when the same router returns, due to control-plan policies, probe packets with a different source IP.

Conclusion

It’s nice to see that one of the top state-of-the-art networks like Facebook’s has implemented an active monitoring solution to reduce time to detection and repair of service outages. NetBeez, NetNORAD, and Pingmesh prove how active network monitoring makes the difference in supporting large and complex network environments.