Working With NetBeez Data Using Pandas

This tutorial will explore using the Python library called Pandas to analyze NetBeez data and help network engineers troubleshoot network performance issues. In this example we will use ping test data. However, the NetBeez API has additional data that can be used from agents, alerts, other tests, such as HTTP and iperf, and WiFi metrics.

To work through this tutorial a basic understanding of Python is needed. At minimum, readers should be able to run Python (version 3) code. In the next sections we will cover:

Introduction to Pandas
How to get started
Preparing the data
Graphing and visualizing the data

Introduction to Pandas

Pandas is a powerful and easy-to-use open-source data analysis and manipulation tool. Dataframes are a core component of Pandas and help to transform and visualize data. Think of dataframes as a class that stores data in a two-dimensional array.

Getting Started

To follow this tutorial, you can download sample ping data from two targets produced from a NetBeez agent. In alternative, you can use your own NetBeez data by using the NetBeez API (I am working on a separate blog post that explains how to query the NetBeez API to export data into a CSV file – subscribe to our blog if you want to be alerted when that’s available).

First, we will need to import our data from the CSVs into dataframes with Pandas, then assign them to variables. The following lines of code imports the CSV files located in the Downloads/Panda_Data/ folder within my macOS folder (~/Downloads/Panda_Data/)

google_target_data = pd.read_csv('~/Downloads/Pandas_Data/www.google.com_data.csv')
youtube_target_data = pd.read_csv('~/Downloads/Pandas_Data/youtube.com_data.csv')

agent_id	target_id	test_template_id	target_name	test_name	heir_type	interface_type	network_interface_id	data_id	error_code	ts	value
66	5	21	youtube.com	YouTube	PingTest	wireless	109	1659102041813-66-21		1659102041813	65.5
66	5	21	youtube.com	YouTube	PingTest	wireless	109	1659102046828-66-21	195	1659102046828
66	5	21	youtube.com	YouTube	PingTest	wireless	109	1659102051841-66-21		1659102051841	65.8
66	5	21	youtube.com	YouTube	PingTest	wireless	109	1659102056856-66-21		1659102056856	65.5
66	5	21	youtube.com	YouTube	PingTest	wireless	109	1659102061870-66-21		1659102061870	66

Data Preparation

Describe is a great method to gain insight into numerical columns which returns statistics like count (total values counted), mean, standard deviation, minimum, maximum, and quartiles. To execute methods on a particular column, enter a column name string inside of square brackets after the dataframe name like youtube_target_data[‘value’].

describe = youtube_target_data['value'].describe()

	value
count	716
mean	60.43882682
standard_deviation	9.774659733
min	42.6
25%	47.725
50%	65.3
75%	66
max	118

The below methods are used to get these values individually:

count = youtube_target_data['value'].count()
mean = youtube_target_data['value'].mean()
median = youtube_target_data['value'].median()
min_value = youtube_target_data['value'].min()
max_value = youtube_target_data['value'].max()
standard_deviation = youtube_target_data['value'].std()

Dropping Data

Pandas provide a method to easily drop unnecessary columns from our dataframe. If the error column is not needed, it can be dropped with the below code:

youtube_target_data.drop(columns=['error_code'], inplace=True)

agent_id	target_id	test_template_id	target_name	test_name	heir_type	interface_type	network_interface_id	data_id	ts	value
66	5	21	youtube.com	YouTube	PingTest	wireless	109	1659102041813-66-21	1659102041813	65.5
66	5	21	youtube.com	YouTube	PingTest	wireless	109	1659102046828-66-21	1659102046828
66	5	21	youtube.com	YouTube	PingTest	wireless	109	1659102051841-66-21	1659102051841	65.8
66	5	21	youtube.com	YouTube	PingTest	wireless	109	1659102056856-66-21	1659102056856	65.5
66	5	21	youtube.com	YouTube	PingTest	wireless	109	1659102061870-66-21	1659102061870	66

Subsetting Data

If not all of the columns are necessary to analyze the data, a subset can be created with all rows and only a select few columns. In the case of the sample data, a new subset dataframe is created with only the columns: target_name, ts, and value.

youtube_target_data = youtube_target_data[['target_name', 'ts', 'value']]

target_name	ts	value
youtube.com	1659102041813	65.5
youtube.com	1659102046828
youtube.com	1659102051841	65.8
youtube.com	1659102056856	65.5
youtube.com	1659102061870	66

Sampling Data

There are methods for getting information on the first five or last five rows in a dataframe. The .head() displays the first five rows and the .tail() method displays the last five rows. If a specific number of rows are desired, an integer can be added inside of the parentheses. In the case of .head(10), the first ten rows are displayed.

youtube_target_data.head(3)

target_name	ts	value
youtube.com	1659102041813	65.5
youtube.com	1659102046828
youtube.com	1659102051841	65.8

Pandas provides a way to sample rows at random with the .sample() method. You will need to indicate how many rows you would like sampled inside of the parenthesis like below:

random_samples = youtube_target_data.sample(n=10)

target_name	ts	value
youtube.com	1659102955206	70.4
youtube.com	1659102809799	65.6
youtube.com	1659104163801	65.9
youtube.com	1659104931449	65
youtube.com	1659103782564	43.5
youtube.com	1659104765897	45.8
youtube.com	1659102061870	66
youtube.com	1659103386351	42.9
youtube.com	1659102940183	66.3
youtube.com	1659102217733	65.9

There are methods for filtering the largest and smallest rows by a particular column. To return three columns with the greatest value, .nlargest(3, ‘value’). Conversely, .nsmallest(3, ‘value’) will return the three rows with the smallest value in the ‘value’ column. The first input parameter is the number of rows to be returned and the second input parameter will be the column to filter by greatest and least.

youtube_target_data.nlargest(3, 'value')

target_name	ts	value
youtube.com	1659105177567	118
youtube.com	1659103230926	96.6
youtube.com	1659103245953	74.7

Missing Data

If missing data exists inside of the dataset, the below method can be used to check which columns have missing data and returns sums of n/a values in each column.

na_totals = youtube_target_data.isna().sum()

column_name	count
target_name	0
ts	0
value	1

One way to work with missing data is to drop the rows that have n/a values. For instance, if there are 717 rows in the dataframe and one row is missing data, using .dropna() will remove the one row of missing data leaving 716 rows.

youtube_target_data = youtube_target_data.dropna()

Another way to work with missing data is to fill in the missing values with a default value, mean, medium, or any value. The .fillna() method will need to be called on the column with the value inside of the parentheses. In the sample data, row two was missing a value in the ‘value’ column. After running the below command, the mean will replace any missing data.

youtube_target_data['value'] = youtube_target_data['value'].fillna(mean)

target_name	ts	value
youtube.com	1659102041813	65.5
youtube.com	1659102046828	60.4388
youtube.com	1659102051841	65.8

Visualizing Data

There are multiple ways to plot data with Python. To plot the sample data, the library Matplotlib may be used. This will help to give a visual insight into our data. First, the libraries will need to be imported:

import matplotlib.pyplot as plt
import datetime as datetime
import matplotlib.dates as mdates

Next, the column of timestamps will be converted to datetime values for both of our sample datasets:

youtube_target_data['ts'] = youtube_target_data['ts'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000))
google_target_data['ts'] = google_target_data['ts'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000))

The x-axis tick labels could be formatted with the below code to only display hours and minutes:

plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))

The sample data will need to be passed to our graph with the below code:

plt.plot(youtube_target_data['ts'],youtube_target_data['value'], color='g', label='Youtube', linewidth=0.75)
plt.plot(google_target_data['ts'],google_target_data['value'], color='b', label='Google', linewidth=0.75)

A label can be added to the y-axis with the below code. The values in the sample data are in milliseconds.

plt.ylabel('ms')

To extend the graph lines to the edge of the graph, the graph’s min and max will need to be adjusted:

plt.xlim([min(google_target_data['ts']), max(google_target_data['ts'])])

Finally, to show the graph with the two plotted lines and the graph’s legend, the below code will be used:

plt.legend()
plt.show()

Conclusion

Pandas is a powerful Python library that can help to analyze and transform data. This tutorial only scratched the surface of what Pandas is capable of. For more information on Pandas, the documentation is a great resource. More information on using NetBeez’s API can be found here.