The author shares their experience with using the Pandas library in Python for data analysis and provides a list of 30 methods they found most useful.
Abstract
The article is a guide for beginners in data analysis using the Pandas library in Python. The author reflects on their 3+ years of experience using Pandas and shares the 30 methods they have found most useful. These methods include reading and storing CSV files, creating and modifying DataFrames, printing descriptive information about the DataFrame, handling missing data, joining DataFrames, sorting and grouping DataFrames, filtering DataFrames, finding unique values, applying functions to DataFrames, handling duplicates, finding the distribution of values, resetting the index of a DataFrame, finding cross-tabulation, and pivoting DataFrames. The author provides code examples for each method and encourages readers to try them out.
Opinions
The author believes that mastering these 30 methods will enable beginners to perform 95% of the tasks they will encounter when working with Pandas.
The author emphasizes the importance of handling missing data and provides a link to a previous blog post on the topic.
The author encourages readers to try out the code examples and provides a link to subscribe to their newsletter for more tips and tricks on data science.
The author mentions that the study is backed by their own experience as well as working with fellow data scientists and seeing their work.
The author thanks the readers for their time and hopes the post was helpful.
The author promotes their newsletter and recommends an AI service that provides the same performance and functions as ChatGPT Plus(GPT-4) but at a more cost-effective price.
The Only 30 Methods You Should Master To Become A Pandas Pro
After using pandas for over three years, here are the 30 methods I have used almost all the time
Pandas is undoubtedly one of the best libraries ever built in Python for tabular data-wrangling and processing tasks.
Being open-source, numerous developers from different parts of the world have contributed to its development and brought it to where it is today — supporting hundreds of methods for various tasks.
However, if you are a newbie and trying to get a firm hold at the Pandas library, things can appear very daunting and overwhelming at first if you start with Pandas’ Official Documentation.
The list of topics is shown below:
List of Topics in Official Pandas API Documentation (Image by Author) (Source: here)
Having been there myself, this blog is intended to assist you in getting started with Pandas.
In other words, in this blog, I will reflect on my 3+ years of experience using Pandas and share those 30 specific methods that I have used almost all the time.
In label-based selection, every label asked for must be in the index of the DataFrame.
Integers are valid labels too, but they refer to the label and not the position.
Consider the following DataFrame.
We use df.loc method for label-based selection.
However, in df.loc[], you are not allowed to use position to filter the DataFrame, as shown below:
To achieve the above, you should use position-based selection using df.iloc[].
Method 4: Selecting by Position
#22–23 Finding Unique Values in a DataFrame
To print all the distinct values in a column, use the unique() method.
If you want to print the number of unique values, use nunique() instead.
#24 Applying a Function to a DataFrame
If you want to apply a function to a column, use the apply() method as demonstrated below:
You can also apply a method to a single column as follows:
#25–26 Handling Duplicates
You can mark all the repeated rows using the df.duplicated() method:
Further, you can drop the duplicated rows using the df.drop_duplicates() method as follows:
#27 Finding the Distribution of Values
To find the frequency of each unique value in a column, use the value_counts() method:
#28 Resetting the Index of a DataFrame
To reset the index of the DataFrame, use the df.reset_index() method:
To drop the old index, pass drop=True as an argument to the above method:
#29 Finding Cross-tabulation
To return the frequency of each combination of values across two columns, use the pd.crosstab() method:
#30 Pivoting DataFrames
Pivot tables are a commonly used data analysis tool in Excel. Similar to crosstabs discussed above, pivot tables in Pandas provide a way to cross-tabulate your data.
Consider the DataFrame below:
With the pd.pivot_table() method, you can convert the column entries to column headers:
Congratulations 🎊, you have just learned about the 30 most useful methods in Pandas.
To conclude, I can confidently say that you will likely use these methods 95% of the time working with Pandas.
The study is backed by my own experience as well as working with fellow Data Scientists and seeing their work.
✉️ Sign-up to my Email list to never miss another article on data science guides, tricks and tips, Machine Learning, SQL, Python, and more. Medium will deliver my next articles right to your inbox.