Working With NetBeez Data Using Pandas

This tutorial will explore using the Python library called Pandas to analyze NetBeez data and help network engineers troubleshoot network performance issues. In this example we will use ping test data. However, the NetBeez API has additional data that can be used from agents, alerts, other tests, such as HTTP and iperf, and WiFi metrics.

To work through this tutorial a basic understanding of Python is needed. At minimum, readers should be able to run Python (version 3) code. In the next sections we will cover:

  • Introduction to Pandas
  • How to get started
  • Preparing the data
  • Graphing and visualizing the data

Introduction to Pandas

Pandas is a powerful and easy-to-use open-source data analysis and manipulation tool. Dataframes are a core component of Pandas and help to transform and visualize data. Think of dataframes as a class that stores data in a two-dimensional array.

Getting Started

To follow this tutorial, you can download sample ping data from two targets produced from a NetBeez agent. In alternative, you can use your own NetBeez data by using the NetBeez API (I am working on a separate blog post that explains how to query the NetBeez API to export data into a CSV file – subscribe to our blog if you want to be alerted when that’s available).

First, we will need to import our data from the CSVs into dataframes with Pandas, then assign them to variables. The following lines of code imports the CSV files located in the Downloads/Panda_Data/ folder within my macOS folder (~/Downloads/Panda_Data/)

google_target_data = pd.read_csv('~/Downloads/Pandas_Data/www.google.com_data.csv')
youtube_target_data = pd.read_csv('~/Downloads/Pandas_Data/youtube.com_data.csv')
agent_idtarget_idtest_template_idtarget_nametest_nameheir_typeinterface_typenetwork_interface_iddata_iderror_codetsvalue
66521youtube.comYouTubePingTestwireless1091659102041813-66-21165910204181365.5
66521youtube.comYouTubePingTestwireless1091659102046828-66-211951659102046828
66521youtube.comYouTubePingTestwireless1091659102051841-66-21165910205184165.8
66521youtube.comYouTubePingTestwireless1091659102056856-66-21165910205685665.5
66521youtube.comYouTubePingTestwireless1091659102061870-66-21165910206187066

Data Preparation

Describe is a great method to gain insight into numerical columns which returns statistics like count (total values counted), mean, standard deviation, minimum, maximum, and quartiles. To execute methods on a particular column, enter a column name string inside of square brackets after the dataframe name like youtube_target_data[‘value’].

describe = youtube_target_data['value'].describe()
value
count716
mean60.43882682
standard_deviation9.774659733
min42.6
25%47.725
50%65.3
75%66
max118

The below methods are used to get these values individually:

count = youtube_target_data['value'].count()
mean = youtube_target_data['value'].mean()
median = youtube_target_data['value'].median()
min_value = youtube_target_data['value'].min()
max_value = youtube_target_data['value'].max()
standard_deviation = youtube_target_data['value'].std()

Dropping Data

Pandas provide a method to easily drop unnecessary columns from our dataframe. If the error column is not needed, it can be dropped with the below code:

youtube_target_data.drop(columns=['error_code'], inplace=True)
agent_idtarget_idtest_template_idtarget_nametest_nameheir_typeinterface_typenetwork_interface_iddata_idtsvalue
66521youtube.comYouTubePingTestwireless1091659102041813-66-21165910204181365.5
66521youtube.comYouTubePingTestwireless1091659102046828-66-211659102046828
66521youtube.comYouTubePingTestwireless1091659102051841-66-21165910205184165.8
66521youtube.comYouTubePingTestwireless1091659102056856-66-21165910205685665.5
66521youtube.comYouTubePingTestwireless1091659102061870-66-21165910206187066

Subsetting Data

If not all of the columns are necessary to analyze the data, a subset can be created with all rows and only a select few columns. In the case of the sample data, a new subset dataframe is created with only the columns: target_name, ts, and value.

youtube_target_data = youtube_target_data[['target_name', 'ts', 'value']]
target_nametsvalue
youtube.com165910204181365.5
youtube.com1659102046828
youtube.com165910205184165.8
youtube.com165910205685665.5
youtube.com165910206187066

Sampling Data

There are methods for getting information on the first five or last five rows in a dataframe. The .head() displays the first five rows and the .tail() method displays the last five rows. If a specific number of rows are desired, an integer can be added inside of the parentheses. In the case of .head(10), the first ten rows are displayed.

youtube_target_data.head(3)
target_nametsvalue
youtube.com165910204181365.5
youtube.com1659102046828
youtube.com165910205184165.8

Pandas provides a way to sample rows at random with the .sample() method. You will need to indicate how many rows you would like sampled inside of the parenthesis like below:

random_samples = youtube_target_data.sample(n=10)
target_nametsvalue
youtube.com165910295520670.4
youtube.com165910280979965.6
youtube.com165910416380165.9
youtube.com165910493144965
youtube.com165910378256443.5
youtube.com165910476589745.8
youtube.com165910206187066
youtube.com165910338635142.9
youtube.com165910294018366.3
youtube.com165910221773365.9

There are methods for filtering the largest and smallest rows by a particular column. To return three columns with the greatest value, .nlargest(3, ‘value’). Conversely, .nsmallest(3, ‘value’) will return the three rows with the smallest value in the ‘value’ column. The first input parameter is the number of rows to be returned and the second input parameter will be the column to filter by greatest and least.

youtube_target_data.nlargest(3, 'value')
target_nametsvalue
youtube.com1659105177567118
youtube.com165910323092696.6
youtube.com165910324595374.7

Missing Data

If missing data exists inside of the dataset, the below method can be used to check which columns have missing data and returns sums of n/a values in each column.

na_totals = youtube_target_data.isna().sum()
column_namecount
target_name0
ts0
value1

One way to work with missing data is to drop the rows that have n/a values. For instance, if there are 717 rows in the dataframe and one row is missing data, using .dropna() will remove the one row of missing data leaving 716 rows.

youtube_target_data = youtube_target_data.dropna()

Another way to work with missing data is to fill in the missing values with a default value, mean, medium, or any value. The .fillna() method will need to be called on the column with the value inside of the parentheses. In the sample data, row two was missing a value in the ‘value’ column. After running the below command, the mean will replace any missing data.

youtube_target_data['value'] = youtube_target_data['value'].fillna(mean)
target_nametsvalue
youtube.com165910204181365.5
youtube.com165910204682860.4388
youtube.com165910205184165.8

Visualizing Data

There are multiple ways to plot data with Python. To plot the sample data, the library Matplotlib may be used. This will help to give a visual insight into our data. First, the libraries will need to be imported:

import matplotlib.pyplot as plt
import datetime as datetime
import matplotlib.dates as mdates

Next, the column of timestamps will be converted to datetime values for both of our sample datasets:

youtube_target_data['ts'] = youtube_target_data['ts'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000))
google_target_data['ts'] = google_target_data['ts'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000))

The x-axis tick labels could be formatted with the below code to only display hours and minutes:

plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))

The sample data will need to be passed to our graph with the below code:

plt.plot(youtube_target_data['ts'],youtube_target_data['value'], color='g', label='Youtube', linewidth=0.75)
plt.plot(google_target_data['ts'],google_target_data['value'], color='b', label='Google', linewidth=0.75)

A label can be added to the y-axis with the below code. The values in the sample data are in milliseconds.

plt.ylabel('ms')

To extend the graph lines to the edge of the graph, the graph’s min and max will need to be adjusted:

plt.xlim([min(google_target_data['ts']), max(google_target_data['ts'])])

Finally, to show the graph with the two plotted lines and the graph’s legend, the below code will be used:

plt.legend()
plt.show()

Conclusion

Pandas is a powerful Python library that can help to analyze and transform data. This tutorial only scratched the surface of what Pandas is capable of. For more information on Pandas, the documentation is a great resource. More information on using NetBeez’s API can be found here.

decoration image

Get your free trial now

Monitor your network from the user perspective

You can share

Twitter Linkedin Facebook

Let's keep in touch

decoration image