Top 10 Categories of Pandas Functions That I Use Most

Get familiar with these functions to help you process data

People love to use Python because it has a versatile repository of third-party libraries for all kinds of work. For data science, one of the most popular libraries for data processing is Pandas. Over the years, because of its open-source nature, many developers have contributed to this project, making pandas powerful for almost any data processing job.

I didn’t count, but I felt like there were hundreds of functions that you can use with Pandas. Although I use maybe twenty or thirty functions frequently, it’s unrealistic to talk about them all. Thus, I’ll just focus on the 10 most useful categories of functions in this post. Once you get along with them well, they can probably address over 70% of your data processing needs.

1. Reading data

We usually read data from external sources. Depending on the format of the source data, we can use the corresponding read_* functions.

read_csv: use it when your source data is in the CSV format. Some notable arguments include header (whether and which row is the header), sep (the delimiter), and usecols (a subset of columns to use).
read_excel: use it when your source data is in Excel format. Some notable arguments include sheet_name (which sheet) and header.
read_pickle: use it when your source data is a pickled DataFrame. Pickling is a good mechanism to store DataFrame, typically better than CSV and Excel.
read_sas: I use this function frequently because I used to use SAS to process data.

2. Writing data

When you’re done processing your data, you may want to save your DataFrame to a file for long-term storage or data exchange with your co-workers.

to_csv: writes to a CSV file. It doesn't preserve some data types, such as dates. The size tends to be bigger than others. I typically set the argument index as False, because I don’t need an extra column to show the index in the data file.
to_excel: writes to an Excel file.
to_pickle: writes to a pickle file. As I just mentioned, I use pickled files such that the data types can be properly preserved when I read them.

3. Data summary/overview

After you read your data to a DataFrame, it’s a good idea to get some descriptives for the dataset.

head: check the first several rows to see if the data are read properly.
tail: check the last several rows. This is equally important. When you deal with a large file, chances are that the reading can be incomplete. By checking the tail, you’ll find out if the reading has been complete.
info: have an overall summary of the dataset. Some useful information includes the data types for the columns and the memory usage.
describe: provide a descriptive summary of the dataset.
shape: the number of rows and columns (it’s an attribute, not a function).

4. Sorting data

I often sort the data after I’ve done most of the other processing steps. Particularly, if I’m going to write the DataFrame to an external file, such as Excel, I almost always sort the data before the export. It’s because sorted data are easier for others to locate the needed information using eyeballs.

sort_values: sort the data by specifying the column names. Because I’m mostly working with files in which rows are observations, the sorting is done by columns.

5. Dealing with duplicates

When we work with real-life datasets, chances are that there are duplicates. For example, some data are entered accidentally twice into the data source. It’s important to remove the duplicates.

duplicated: identify whether there are duplicates in the DataFrame. You can specify what columns are used to identify duplicates.
drop_duplicates: remove duplicates from the DataFrame. I don’t use this function blindly. For cautiousness, I always use the duplicated function to check the duplicates first.

6. Dealing with missing values

It’s almost unavoidable that there are missing values in your datasets. It’s a good practice you inspect the missingness of your dataset and decide how to do with the missing values.

isnull: check the missingness of your DataFrame.
dropna: drops the observations with missing data. Notable arguments include how (how an observation is determined to be dropped or not) and thred (the number of missing values to be eligible for dropping).
fillna: fills the missing values by the specified approach, such as fill forward (ffill).

7. Extracting new data

Columns can contain multiple pieces of information. For example, our dataset may have data like proj-0001, in which the first four letters are the project’s acronym while the last four digits are the unique ID for the subjects. To extract these data, I often use the following functions.

map: create a column by using information from a single column. In other words, you call this function on a Series object, like df[“sub_id”] = df[“temp_id”].map(lambda x: int(x[-4:])).
apply: create one or multiple columns by using data from multiple columns. You often need to specify axis=1 when you’re creating columns.

8. Transforming data

There are typically two kinds of data. One is the “wide” format, which refers to that each row represents a single subject or observation and columns include repeated measures for the subject. The other is the “long” format. In this format, a subject has multiple rows, and each row may represent a measure of a certain timepoint. Often, you may need to convert data between these two formats.

melt: convert a wide dataset to a long dataset. Notable arguments include id_vars (for the identifiers) and value_vars (the list of columns whose values contribute to a value column).
pivot: convert a long dataset to a wide dataset. Notable arguments include index (the unique identifiers), columns (the columns becoming value columns), and values (the columns having values).

9. Merging datasets

When you have separate data sources, you may want to merge them such that you have a combined dataset.

merge: merge the current one with the other one. You specify one or multiple columns as the identifier for merging (the on argument, or left_on & right_on). Other notable arguments include how (such as inner or left, or outer), and suffixes (what suffixes are used for the two datasets).
concat: concatenate DataFrame objects along rows or columns. It’s useful when you have multiple DataFrame objects of the same shape/store the same information.

10. Summaries by groups

Our datasets typically include categorical variables to indicate the data’s characteristics, such as schools for students, projects for subjects, and the class levels for tickets.

groupby: create a GroupBy object, you can specify one or more columns.
mean: you can call mean on the GroupBy object, to find out the means. You can do the same thing for other stats, such as std.
size: frequencies for the groups
agg: a customizable aggregating function. In this function, you can request statistics done for the specified column(s).

Conclusions

In this post, I reviewed the top 10 categories of functions that I use often in my daily data processing jobs. Although the reviews are brief, they provide a guideline for you to organize your learning of Pandas.

I hope you find this article useful. Thanks for reading.