avatarEsteban Thilliez

Summary

The provided web content is an introductory guide to Pandas, a Python library essential for data science, covering its installation, data structures, indexing/slicing, and handling of missing data.

Abstract

The article introduces Pandas as a pivotal Python library for data analysis, emphasizing its utility in data wrangling and manipulation. It explains the library's role in facilitating the handling of labeled data and outlines its key benefits, such as easy management of missing data, size mutability, and powerful data analysis tools. The guide details the two primary data structures in Pandas—Series and DataFrame—and provides examples of their creation and usage. It also delves into various indexing techniques, including label-based, integer-based, and boolean indexing, and discusses methods for handling missing data through dropping or filling techniques. The author concludes by underscoring the importance of Pandas in data science and invites readers to follow for more in-depth articles on the subject.

Opinions

  • The author considers Pandas a fundamental tool for real-world data analysis in Python.
  • Pandas is praised for its expressive data structures and ease of use, particularly in cleaning and organizing raw data.
  • The article suggests that the ability to handle missing data effectively is a significant advantage of using Pandas.
  • The author implies that the versatility of Pandas' data structures, such as the Series and DataFrame, is crucial for various data manipulation tasks.
  • The guide promotes the use of Pandas for a range of data analysis tasks, including time series analysis, statistical modeling, and machine learning.
  • The author encourages the audience to engage with their content by following, clapping, or subscribing, indicating a belief in the value and educational impact of their work.

Pandas Basics — 1. Data Structures, Indexing/Slicing, Missing Values Handling

I’ve already talked about NumPy and Matplotlib, which can be considered prerequisites to data science in Python.

The last prerequisite remaining is the library Pandas. That’s why I’m writing a small series for this library: I will soon write a series about data science in Python.

But that’s for later, let’s focus on Pandas before.

What is Pandas?

Pandas is a Python library that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It is a fundamental high-level building block for doing practical, real-world data analysis in Python.

How does Pandas Help for Data Science?

Pandas is particularly useful for data wrangling, which is the process of cleaning, formatting, and preparing data for analysis. It helps you to quickly convert raw data into a clean and organized form, so you can start analyzing it. Pandas also provides functionality for performing sophisticated data analysis, such as time series analysis, statistical modeling, and machine learning.

Some of the main benefits of using Pandas include:

  • Easy handling of missing data (represented as NaN values in the data)
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can indicate that the object should be aligned along one of its axes
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets

How to Install Pandas?

To install Pandas, you can use pip, the Python package manager. Open a terminal window and type:

pip install pandas

Pandas Data Structures

Panda provides two main data structures: the Series and the DataFrame.

The Series: A Pandas Series is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a series in a chart. A Series is created by passing a list of values to the pd.Series constructor, along with an optional index. For example:

import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

Output:

a    1.0
b    3.0
c    5.0
d    NaN
e    6.0
f    8.0
dtype: float64

This creates a Series with six elements, containing a mix of integers and NaN (Not a Number) values. The Series is assigned an index by default, starting from 0 and going up to the number of elements — 1. We can also specify our own index labels when creating the Series.

The DataFrame: a DataFrame is a 2-dimensional size-mutable, tabular data structure with rows and columns. It is similar to a spreadsheet, a SQL table, or a dictionary of Series objects. You can think of a DataFrame as a collection of Series that share the same index.

A DataFrame can be created from a variety of data sources, such as:

  • A NumPy array
  • A list or dictionary of Series objects
  • A 2-dimensional NumPy array
  • A list of dictionaries
  • A dictionary of dictionaries
  • etc.

Here is an example of creating a DataFrame:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve'],
        'age': [25, 30, 35, 40, 45],
        'income': [50000, 60000, 70000, 80000, 90000]}
df = pd.DataFrame(data)
print(df)

Output:

   a  b  c
0  1  2  3
1  4  5  6

This creates a DataFrame with five rows and three columns, containing the data for five individuals. The DataFrame has a default index, starting from 0 and going up to the number of rows — 1. We can also specify our own index labels when creating the DataFrame.

Both Series and DataFrames have a number of methods and attributes that can be used to manipulate and analyze the data they contain. For example, we can use the head() and tail() methods to view the first and last few rows of a DataFrame, respectively. We can use the describe() method to get summary statistics of the numeric columns in a DataFrame. We can also use indexing and slicing to select specific rows or columns from a DataFrame.

Indexing and Selection

There are many techniques to select and index data with Pandas. These techniques include label-based indexing, integer-based indexing, and boolean indexing.

Label-based indexing refers to indexing data using the labels of rows and columns rather than their integer position. This is done using the .loc indexer. For example, to select a specific row in a DataFrame, we can use the following syntax:

df.loc[row_label]

Where row_label is the label of the row you want to select. Similarly, to select a specific column in a DataFrame, we can use the following syntax:

df.loc[:, col_label]

Where col_label is the label of the column you want to select. You can also select a specific subset of rows and columns by specifying both the row labels and column labels in the indexer, like so:

df.loc[row_labels, col_labels]

Integer-based indexing refers to indexing data using the integer position of rows and columns rather than their labels. This is done using the .iloc indexer. The syntax for selecting rows and columns using integer-based indexing is similar to label-based indexing, except that you use integer indices instead of labels. For example:

df.iloc[row_index]
df.iloc[:, col_index]
df.iloc[row_indices, col_indices]

Boolean indexing refers to indexing data using a boolean array. This is useful for selecting rows that meet certain criteria. For example, to select all rows where the value of a specific column is greater than some threshold, we can use the following syntax:

df[df[col] > threshold]

This will return a new DataFrame containing only the rows where the value of col is greater than threshold.

Handling Missing Data

In a real-world data analysis scenario, it is not uncommon to encounter missing or null values in your dataset. These missing values can cause problems when trying to perform certain operations on the data, and can also be misleading if not handled properly.

One way to handle missing data in Pandas is to simply drop rows or columns that contain null values. This can be done using the .dropna() function, which has several optional parameters that allow you to specify which rows or columns to drop and how to handle missing values.

For example, to drop rows that contain any null values, you can use the following syntax:

df.dropna()

To drop rows that contain all null values, you can set the how parameter to 'all':

df.dropna(how='all')

To drop columns that contain any null values, you can set the axis parameter to 1:

df.dropna(axis=1)

Another way to handle missing data is to fill in the missing values with a placeholder value. This can be done using the .fillna() function, which takes a value to use as a placeholder as an argument. For example, to fill missing values with 0:

df.fillna(0)

You can also use the inplace parameter to fill missing values in place, rather than returning a new DataFrame with the missing values filled in. For example:

df.fillna(0, inplace=True)

This will fill any missing values in the DataFrame with 0 and modify the original DataFrame in place.

It’s important to note that simply dropping rows or columns with missing data or filling in missing values with a placeholder may not always be the best solution, as it can introduce bias or alter the overall statistical properties of the data. It’s a good idea to carefully consider the implications of these actions before implementing them.

In addition to the .dropna() and .fillna() functions, Pandas also provides several other functions for handling missing data, such as .isnull(), which returns a boolean mask indicating which values are missing, and .notnull(), which returns the opposite of .isnull(). These functions can be useful for identifying and manipulating missing data in your DataFrame.

Final Note

Pandas is an essential tool for data science and data analysis in Python. Its powerful data manipulation and analysis capabilities, combined with its user-friendly interface and intuitive syntax, make it a go-to tool for working with tabular data.

Whether you’re cleaning and wrangling data, performing statistical analysis, or building machine learning models, Pandas is a valuable tool to have in your toolkit. By learning the basics of Pandas, you can greatly enhance your ability to perform data-driven tasks in your work as a data scientist or data analyst.

In the next article, I’ll explain how to manipulate data using Pandas, so be sure to follow me if you don’t want to miss this article!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you liked the story, don’t forget to clap, comment, and maybe follow me if you want to explore more of my content :)

You can also subscribe to me via email to be notified every time I publish a new story, just click here!

If you’re not subscribed to Medium yet and wish to support me or get access to all my stories, you can use my link:

Python
Data Science
AI
Artificial Intelligence
Computer Science
Recommended from ReadMedium
avatarJYOTI PRAKASH DEY
14 pandas tricks you MUST know

7 min read