Pandas Basics — 1. Data Structures, Indexing/Slicing, Missing Values Handling

I’ve already talked about NumPy and Matplotlib, which can be considered prerequisites to data science in Python.

The last prerequisite remaining is the library Pandas. That’s why I’m writing a small series for this library: I will soon write a series about data science in Python.

But that’s for later, let’s focus on Pandas before.

What is Pandas?

Pandas is a Python library that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It is a fundamental high-level building block for doing practical, real-world data analysis in Python.

How does Pandas Help for Data Science?

Pandas is particularly useful for data wrangling, which is the process of cleaning, formatting, and preparing data for analysis. It helps you to quickly convert raw data into a clean and organized form, so you can start analyzing it. Pandas also provides functionality for performing sophisticated data analysis, such as time series analysis, statistical modeling, and machine learning.

Some of the main benefits of using Pandas include:

Easy handling of missing data (represented as NaN values in the data)
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can indicate that the object should be aligned along one of its axes
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets

How to Install Pandas?

To install Pandas, you can use pip, the Python package manager. Open a terminal window and type:

pip install pandas

Pandas Data Structures

Panda provides two main data structures: the Series and the DataFrame.

The Series: A Pandas Series is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a series in a chart. A Series is created by passing a list of values to the pd.Series constructor, along with an optional index. For example:

import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

Output:

a    1.0
b    3.0
c    5.0
d    NaN
e    6.0
f    8.0
dtype: float64

This creates a Series with six elements, containing a mix of integers and NaN (Not a Number) values. The Series is assigned an index by default, starting from 0 and going up to the number of elements — 1. We can also specify our own index labels when creating the Series.

The DataFrame: a DataFrame is a 2-dimensional size-mutable, tabular data structure with rows and columns. It is similar to a spreadsheet, a SQL table, or a dictionary of Series objects. You can think of a DataFrame as a collection of Series that share the same index.

A DataFrame can be created from a variety of data sources, such as:

A NumPy array
A list or dictionary of Series objects
A 2-dimensional NumPy array
A list of dictionaries
A dictionary of dictionaries
etc.

Here is an example of creating a DataFrame:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve'],
        'age': [25, 30, 35, 40, 45],
        'income': [50000, 60000, 70000, 80000, 90000]}
df = pd.DataFrame(data)
print(df)

Output:

   a  b  c
0  1  2  3
1  4  5  6

This creates a DataFrame with five rows and three columns, containing the data for five individuals. The DataFrame has a default index, starting from 0 and going up to the number of rows — 1. We can also specify our own index labels when creating the DataFrame.

Both Series and DataFrames have a number of methods and attributes that can be used to manipulate and analyze the data they contain. For example, we can use the head() and tail() methods to view the first and last few rows of a DataFrame, respectively. We can use the describe() method to get summary statistics of the numeric columns in a DataFrame. We can also use indexing and slicing to select specific rows or columns from a DataFrame.

Indexing and Selection

There are many techniques to select and index data with Pandas. These techniques include label-based indexing, integer-based indexing, and boolean indexing.

Label-based indexing refers to indexing data using the labels of rows and columns rather than their integer position. This is done using the .loc indexer. For example, to select a specific row in a DataFrame, we can use the following syntax:

df.loc[row_label]

Where row_label is the label of the row you want to select. Similarly, to select a specific column in a DataFrame, we can use the following syntax:

df.loc[:, col_label]

Where col_label is the label of the column you want to select. You can also select a specific subset of rows and columns by specifying both the row labels and column labels in the indexer, like so:

df.loc[row_labels, col_labels]

Integer-based indexing refers to indexing data using the integer position of rows and columns rather than their labels. This is done using the .iloc indexer. The syntax for selecting rows and columns using integer-based indexing is similar to label-based indexing, except that you use integer indices instead of labels. For example:

df.iloc[row_index]
df.iloc[:, col_index]
df.iloc[row_indices, col_indices]

Boolean indexing refers to indexing data using a boolean array. This is useful for selecting rows that meet certain criteria. For example, to select all rows where the value of a specific column is greater than some threshold, we can use the following syntax:

df[df[col] > threshold]

This will return a new DataFrame containing only the rows where the value of col is greater than threshold.

Handling Missing Data

In a real-world data analysis scenario, it is not uncommon to encounter missing or null values in your dataset. These missing values can cause problems when trying to perform certain operations on the data, and can also be misleading if not handled properly.

One way to handle missing data in Pandas is to simply drop rows or columns that contain null values. This can be done using the .dropna() function, which has several optional parameters that allow you to specify which rows or columns to drop and how to handle missing values.

For example, to drop rows that contain any null values, you can use the following syntax:

df.dropna()

To drop rows that contain all null values, you can set the how parameter to 'all':

df.dropna(how='all')

To drop columns that contain any null values, you can set the axis parameter to 1:

df.dropna(axis=1)

Another way to handle missing data is to fill in the missing values with a placeholder value. This can be done using the .fillna() function, which takes a value to use as a placeholder as an argument. For example, to fill missing values with 0:

df.fillna(0)

You can also use the inplace parameter to fill missing values in place, rather than returning a new DataFrame with the missing values filled in. For example:

df.fillna(0, inplace=True)

This will fill any missing values in the DataFrame with 0 and modify the original DataFrame in place.

It’s important to note that simply dropping rows or columns with missing data or filling in missing values with a placeholder may not always be the best solution, as it can introduce bias or alter the overall statistical properties of the data. It’s a good idea to carefully consider the implications of these actions before implementing them.

In addition to the .dropna() and .fillna() functions, Pandas also provides several other functions for handling missing data, such as .isnull(), which returns a boolean mask indicating which values are missing, and .notnull(), which returns the opposite of .isnull(). These functions can be useful for identifying and manipulating missing data in your DataFrame.

Final Note

Pandas is an essential tool for data science and data analysis in Python. Its powerful data manipulation and analysis capabilities, combined with its user-friendly interface and intuitive syntax, make it a go-to tool for working with tabular data.

Whether you’re cleaning and wrangling data, performing statistical analysis, or building machine learning models, Pandas is a valuable tool to have in your toolkit. By learning the basics of Pandas, you can greatly enhance your ability to perform data-driven tasks in your work as a data scientist or data analyst.

In the next article, I’ll explain how to manipulate data using Pandas, so be sure to follow me if you don’t want to miss this article!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you liked the story, don’t forget to clap, comment, and maybe follow me if you want to explore more of my content :)

You can also subscribe to me via email to be notified every time I publish a new story, just click here!

If you’re not subscribed to Medium yet and wish to support me or get access to all my stories, you can use my link:

Join Medium with my referral link — Esteban Thilliez

Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…

medium.com