avatarAvi Chawla

Summary

The web content provides an overview of the 20% of Pandas functions that are most frequently used by data scientists for 80% of their tasks, applying Pareto's Principle to the Pandas library in Python.

Abstract

The article titled "20% of Pandas Functions that Data Scientists Use 80% of the Time" applies Pareto's Principle to the Pandas library, emphasizing the importance of mastering a core set of functions for efficient data manipulation. It introduces readers to essential operations such as reading and saving CSV files, merging and concatenating DataFrames, sorting and filtering data, and handling missing values. The author, through practical code snippets and references to the official Pandas documentation, aims to equip data scientists with the necessary skills to perform common data analysis tasks. The post concludes by encouraging readers to practice these functions on a dummy DataFrame and to consult the Pandas official documentation for in-depth understanding.

Opinions

  • The author believes that mastering a subset of Pandas functions is sufficient for most data science tasks, adhering to the 80-20 rule.
  • Practical examples and code snippets are considered effective tools for learning and applying Pandas functions.
  • The article suggests that hands-on experience, such as using a jupyter notebook, is crucial for solidifying one's understanding of Pandas.
  • The author holds the Pandas official documentation in high regard, recommending it as a primary resource for learning and reference.

20% of Pandas Functions that Data Scientists Use 80% of the Time

Putting Pareto’s Principle to work on the Pandas library

Photo by Austin Distel on Unsplash

Mastering an entire Python library like Pandas can be challenging for anyone. However, if we take a step back and think, do we really need to be aware of every minute detail of a specific library, especially when we live in a world governed by Pareto’s Principle? For those who don’t know, Pareto’s Principle (also known as the 80–20 rule) says that 20% of your inputs will always contribute towards generating 80% of your outputs.

Therefore, this post is my attempt to apply the Pareto’s Principle to the Pandas library and introduce you to 20% of those specific Pandas functions you are likely to use 80% of your time working with DataFrames. The methods mentioned below are what I have found myself utilizing repeatedly in my day-to-day work and feel are necessary and sufficient to be acquainted with for anyone getting started with Pandas.

1/n: Reading a CSV file:

If you want to read a CSV file in Pandas, use the pd.read_csv() method as demonstrated below:

Code snippet for reading a CSV file (Image by author created using snappify.io)

Read the documentation here.

2/n: Saving a DataFrame to a CSV file:

If you want to save DataFrame to a CSV file, use the to_csv() method as demonstrated below:

Code snippet for saving DataFrame to a CSV file (Image by author created using snappify.io)

Read the documentation here.

3/n: Creating a DataFrame from a list of lists:

If you want to create a DataFrame from a list of lists, use the pd.DataFrame() method as demonstrated below:

Code snippet for creating a DataFrame from a list of lists (Image by author created using snappify.io)

Read the documentation here.

4/n: Creating a DataFrame from a dictionary:

If you want to create a DataFrame from a dictionary, use the pd.DataFrame() method as demonstrated below:

Code snippet for creating a DataFrame from a dictionary (Image by author created using snappify.io)

Read the documentation here.

5/n: Merging DataFrames:

Merge operation in DataFrames is the same as the JOIN operation in SQL. We use it to join two DataFrames on one or more columns. If you want to merge two DataFrames, use the pd.merge() method as demonstrated below:

Code snippet for merging DataFrames (Image by author created using snappify.io)

Read the documentation here.

6/n: Sorting a DataFrame:

If you want to sort a DataFrame based on the values in a particular column, use the sort_values() method as demonstrated below:

Code snippet for sorting a DataFrame (Image by author created using snappify.io)

Read the documentation here.

7/n: Concatenating DataFrames:

If you want to concatenate DataFrames, use the pd.concat() method as demonstrated below:

Code snippet for concatenating DataFrames (Image by author created using snappify.io)

Read the documentation here.

  • axis = 1 stacks columns together.
  • axis = 0 stacks rows together, provided column header match.

8/n: Rename column name:

If you want to rename one or more columns in a DataFrame, use the rename() method as demonstrated below:

Code snippet for renaming columns in a DataFrame (Image by author created using snappify.io)

Read the documentation here.

9/n: Add New Column:

If you want to add a new column to a DataFrame, you can use the usual assignment operation as demonstrated below:

Code snippet for adding a new column to a DataFrame (Image by author created using snappify.io)

10/n: Filter DataFrame based on condition:

If you want to filter rows from a DataFrame based on a condition, you can do so as shown below:

Code snippet for filtering a DataFrame (Image by author created using snappify.io)

11/n: Drop Column(s):

If you want to drop one or more columns from a DataFrame, use the drop() method as demonstrated below:

Code snippet for dropping columns from a DataFrame (Image by author created using snappify.io)

Read the documentation here.

12/n: GroupBy:

If you want to perform an aggregation operation after grouping, use the groupby() method as demonstrated below:

Code snippet for grouping a DataFrame (Image by author created using snappify.io)

Read the documentation here.

13/n: Unique Values in a column:

If you want to count or print the unique value in a column of a DataFrame, use the unique() or unique() method as demonstrated below:

Code snippet for finding unique values in a DataFrame column (Image by author created using snappify.io)

Read the documentation here.

14/n: Fill NaN values:

If you want to replace NaN values in a column with some other value, use the fillna() method as demonstrated below:

Code snippet for filling NaN values in a DataFrame (Image by author created using snappify.io)

Read the documentation here.

15/n: Apply Function on a column:

If you want to apply a function to a column, use the apply() method as demonstrated below:

Code snippet for applying a function on a DataFrame (Image by author created using snappify.io)

Read the documentation here.

16/n: Remove Duplicates:

If you want to remove duplicate values, use the drop_duplicates() method as demonstrated below:

Code snippet for removing duplicated from a DataFrame (Image by author created using snappify.io)

Read the documentation here.

17/n: Value Counts:

If you want to find the frequency of each value in a column, use the value_counts() method as demonstrated below:

Code snippet for counting the frequency of values in a column (Image by author created using snappify.io)

18/n: Size of a DataFrame:

If you want to find the size of a DataFrame, use the .shape attribute as demonstrated below:

To conclude, in this post, I covered some of the most commonly used functions/methods in Pandas to help you get started with this library. Though this post will be helpful for you to make you comfortable with the syntax, I would highly recommend creating a dummy DataFrame of your own and experimenting with it in a jupyter notebook.

Further, there is no better place than referencing the official Pandas documentation available here to acquire fundamental and practical knowledge of various methods in Pandas. Pandas official documentation provides a detailed explanation of each of the arguments accepted by a function along with a practical example, which in my opinion, is an excellent way to acquire Pandas expertise.

Thanks for reading. I hope this post was helpful.

Pandas
Data Science
Python
Dataframes
Recommended from ReadMedium