This tutorial will explore using the Python library called Pandas to analyze NetBeez data and help network engineers troubleshoot network performance issues. In this example we will use ping test data. However, the NetBeez API has additional data that can be used from agents, alerts, other tests, such as HTTP and iperf, and WiFi metrics.
To work through this tutorial a basic understanding of Python is needed. At minimum, readers should be able to run Python (version 3) code. In the next sections we will cover:
- Introduction to Pandas
- How to get started
- Preparing the data
- Graphing and visualizing the data
Introduction to Pandas
Pandas is a powerful and easy-to-use open-source data analysis and manipulation tool. Dataframes are a core component of Pandas and help to transform and visualize data. Think of dataframes as a class that stores data in a two-dimensional array.
Getting Started
To follow this tutorial, you can download sample ping data from two targets produced from a NetBeez agent. In alternative, you can use your own NetBeez data by using the NetBeez API (I am working on a separate blog post that explains how to query the NetBeez API to export data into a CSV file – subscribe to our blog if you want to be alerted when that’s available).
First, we will need to import our data from the CSVs into dataframes with Pandas, then assign them to variables. The following lines of code imports the CSV files located in the Downloads/Panda_Data/ folder within my macOS folder (~/Downloads/Panda_Data/)
google_target_data = pd.read_csv('~/Downloads/Pandas_Data/www.google.com_data.csv')
youtube_target_data = pd.read_csv('~/Downloads/Pandas_Data/youtube.com_data.csv')
agent_id | target_id | test_template_id | target_name | test_name | heir_type | interface_type | network_interface_id | data_id | error_code | ts | value |
66 | 5 | 21 | youtube.com | YouTube | PingTest | wireless | 109 | 1659102041813-66-21 | 1659102041813 | 65.5 | |
66 | 5 | 21 | youtube.com | YouTube | PingTest | wireless | 109 | 1659102046828-66-21 | 195 | 1659102046828 | |
66 | 5 | 21 | youtube.com | YouTube | PingTest | wireless | 109 | 1659102051841-66-21 | 1659102051841 | 65.8 | |
66 | 5 | 21 | youtube.com | YouTube | PingTest | wireless | 109 | 1659102056856-66-21 | 1659102056856 | 65.5 | |
66 | 5 | 21 | youtube.com | YouTube | PingTest | wireless | 109 | 1659102061870-66-21 | 1659102061870 | 66 |
Data Preparation
Describe is a great method to gain insight into numerical columns which returns statistics like count (total values counted), mean, standard deviation, minimum, maximum, and quartiles. To execute methods on a particular column, enter a column name string inside of square brackets after the dataframe name like youtube_target_data[‘value’].
describe = youtube_target_data['value'].describe()
value | |
count | 716 |
mean | 60.43882682 |
standard_deviation | 9.774659733 |
min | 42.6 |
25% | 47.725 |
50% | 65.3 |
75% | 66 |
max | 118 |
The below methods are used to get these values individually:
count = youtube_target_data['value'].count()
mean = youtube_target_data['value'].mean()
median = youtube_target_data['value'].median()
min_value = youtube_target_data['value'].min()
max_value = youtube_target_data['value'].max()
standard_deviation = youtube_target_data['value'].std()
Dropping Data
Pandas provide a method to easily drop unnecessary columns from our dataframe. If the error column is not needed, it can be dropped with the below code:
youtube_target_data.drop(columns=['error_code'], inplace=True)
agent_id | target_id | test_template_id | target_name | test_name | heir_type | interface_type | network_interface_id | data_id | ts | value |
66 | 5 | 21 | youtube.com | YouTube | PingTest | wireless | 109 | 1659102041813-66-21 | 1659102041813 | 65.5 |
66 | 5 | 21 | youtube.com | YouTube | PingTest | wireless | 109 | 1659102046828-66-21 | 1659102046828 | |
66 | 5 | 21 | youtube.com | YouTube | PingTest | wireless | 109 | 1659102051841-66-21 | 1659102051841 | 65.8 |
66 | 5 | 21 | youtube.com | YouTube | PingTest | wireless | 109 | 1659102056856-66-21 | 1659102056856 | 65.5 |
66 | 5 | 21 | youtube.com | YouTube | PingTest | wireless | 109 | 1659102061870-66-21 | 1659102061870 | 66 |
Subsetting Data
If not all of the columns are necessary to analyze the data, a subset can be created with all rows and only a select few columns. In the case of the sample data, a new subset dataframe is created with only the columns: target_name, ts, and value.
youtube_target_data = youtube_target_data[['target_name', 'ts', 'value']]
target_name | ts | value |
youtube.com | 1659102041813 | 65.5 |
youtube.com | 1659102046828 | |
youtube.com | 1659102051841 | 65.8 |
youtube.com | 1659102056856 | 65.5 |
youtube.com | 1659102061870 | 66 |
Sampling Data
There are methods for getting information on the first five or last five rows in a dataframe. The .head() displays the first five rows and the .tail() method displays the last five rows. If a specific number of rows are desired, an integer can be added inside of the parentheses. In the case of .head(10), the first ten rows are displayed.
youtube_target_data.head(3)
target_name | ts | value |
youtube.com | 1659102041813 | 65.5 |
youtube.com | 1659102046828 | |
youtube.com | 1659102051841 | 65.8 |
Pandas provides a way to sample rows at random with the .sample() method. You will need to indicate how many rows you would like sampled inside of the parenthesis like below:
random_samples = youtube_target_data.sample(n=10)
target_name | ts | value |
youtube.com | 1659102955206 | 70.4 |
youtube.com | 1659102809799 | 65.6 |
youtube.com | 1659104163801 | 65.9 |
youtube.com | 1659104931449 | 65 |
youtube.com | 1659103782564 | 43.5 |
youtube.com | 1659104765897 | 45.8 |
youtube.com | 1659102061870 | 66 |
youtube.com | 1659103386351 | 42.9 |
youtube.com | 1659102940183 | 66.3 |
youtube.com | 1659102217733 | 65.9 |
There are methods for filtering the largest and smallest rows by a particular column. To return three columns with the greatest value, .nlargest(3, ‘value’). Conversely, .nsmallest(3, ‘value’) will return the three rows with the smallest value in the ‘value’ column. The first input parameter is the number of rows to be returned and the second input parameter will be the column to filter by greatest and least.
youtube_target_data.nlargest(3, 'value')
target_name | ts | value |
youtube.com | 1659105177567 | 118 |
youtube.com | 1659103230926 | 96.6 |
youtube.com | 1659103245953 | 74.7 |
Missing Data
If missing data exists inside of the dataset, the below method can be used to check which columns have missing data and returns sums of n/a values in each column.
na_totals = youtube_target_data.isna().sum()
column_name | count |
target_name | 0 |
ts | 0 |
value | 1 |
One way to work with missing data is to drop the rows that have n/a values. For instance, if there are 717 rows in the dataframe and one row is missing data, using .dropna() will remove the one row of missing data leaving 716 rows.
youtube_target_data = youtube_target_data.dropna()
Another way to work with missing data is to fill in the missing values with a default value, mean, medium, or any value. The .fillna() method will need to be called on the column with the value inside of the parentheses. In the sample data, row two was missing a value in the ‘value’ column. After running the below command, the mean will replace any missing data.
youtube_target_data['value'] = youtube_target_data['value'].fillna(mean)
target_name | ts | value |
youtube.com | 1659102041813 | 65.5 |
youtube.com | 1659102046828 | 60.4388 |
youtube.com | 1659102051841 | 65.8 |
Visualizing Data
There are multiple ways to plot data with Python. To plot the sample data, the library Matplotlib may be used. This will help to give a visual insight into our data. First, the libraries will need to be imported:
import matplotlib.pyplot as plt
import datetime as datetime
import matplotlib.dates as mdates
Next, the column of timestamps will be converted to datetime values for both of our sample datasets:
youtube_target_data['ts'] = youtube_target_data['ts'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000))
google_target_data['ts'] = google_target_data['ts'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000))
The x-axis tick labels could be formatted with the below code to only display hours and minutes:
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
The sample data will need to be passed to our graph with the below code:
plt.plot(youtube_target_data['ts'],youtube_target_data['value'], color='g', label='Youtube', linewidth=0.75)
plt.plot(google_target_data['ts'],google_target_data['value'], color='b', label='Google', linewidth=0.75)
A label can be added to the y-axis with the below code. The values in the sample data are in milliseconds.
plt.ylabel('ms')
To extend the graph lines to the edge of the graph, the graph’s min and max will need to be adjusted:
plt.xlim([min(google_target_data['ts']), max(google_target_data['ts'])])
Finally, to show the graph with the two plotted lines and the graph’s legend, the below code will be used:
plt.legend()
plt.show()
Conclusion
Pandas is a powerful Python library that can help to analyze and transform data. This tutorial only scratched the surface of what Pandas is capable of. For more information on Pandas, the documentation is a great resource. More information on using NetBeez’s API can be found here.