40 Examples to Master Pandas
A comprehensive practical guide

Pandas is one of the most widely-used data analysis and manipulation libraries. It provides numerous functions and methods to clean, process, manipulate, and analyze data.
The best way to get comfortable working with Pandas is through practice. I previously wrote a practical guide that contains 30 examples.
In this article, I will enrich the examples to cover a broader scope together with the previous article. 40 examples in this article will include not only the basic functions and techniques but also some extreme cases.
Most of the examples include the functions and methods that were not discussed in the previous article. The few examples that cover the same functions are the ones that I want to emphasize and explain again with a different example.
We will be using a marketing and a grocery data set to do the examples. The first example is reading the csv files into Pandas dataframes.
1. Reading csv files
The read_csv function provides flexible ways for reading csv files into Pandas dataframes.
import numpy as np
import pandas as pdmarketing = pd.read_csv("/content/DirectMarketing.csv")
groceries = pd.read_csv("/content/Groceries_dataset.csv")

2. Changing data type with astype
The dates need to be stored in the datetime data type in order to use the datetime functions of Pandas. Let’s check the data type of the columns of the groceries dataframe.
groceries.dtypesMember_number int64
Date object
itemDescription objectAs you can see, the data type of the date column is object. We can change it using the astype function.
groceries['Date'] = groceries['Date'].astype("datetime64")groceries.dtypesMember_number int64
Date datetime64[ns]
itemDescription object3. Changing the data type with to_datetime
We can also use the to_datetime function to assign appropriate data types for dates. The syntax is a little different than the astype function.
groceries['Date'] = pd.to_datetime(groceries['Date'])4. Parsing dates
In the first examples, I mentioned that the read_csv function is quite flexible at reading the csv files. It can also handle the dates. We can assign appropriate data types for dates while reading the data. It will save us from having to change the data type later on.
groceries = pd.read_csv("/content/Groceries_dataset.csv", parse_dates=['Date'])groceries.dtypesMember_number int64
Date datetime64[ns]
itemDescription object5. Filtering with the isin method
There are many ways to filter a dataframe based on the values. We can use logical operators such as equal (==), not equal (!=), or greater than (>).
The isin method allows to filter based on a specific set of values. We can just pass a list of the values we want to filter.
groceries[groceries.Member_number.isin([3737, 2433, 3915, 2625])].shape(126, 3)There are 126 entries that belong to the customers whose member number is given in the list.
6. Tilde operator
The tilde (~) operator can be used as “not” while applying filters. For instance, we can find the complement of the filtered rows in the previous example by just adding the tilde operator at the beginning.
groceries[~groceries.Member_number.isin([3737, 2433, 3915, 2625])].shape(38639, 3)7. Value counts with normalization
The value_counts is one of most frequently used functions. It counts the number of occurrences of each value and returns a series. If it is used with the normalize parameter, we get an overview of the percentage of the occurrences.
marketing.Catalogs.value_counts(normalize=True)12 0.282
6 0.252
24 0.233
18 0.233The most frequent value in the catalogs column is 12 which occupies about 28 percent of the entire column.
8. Setting a column as index
Pandas assigns integer index to dataframes by default but we can change it to any column using the set_index function.
For instance, we can set the date column as the index of the groceries dataframe.
groceries.set_index('Date', inplace=True)
9. Resetting the index
When some rows are dropped, Pandas does not automatically reset the index. Similarly, when two dataframes are concatenated, the indices will not be reset. In such cases, the new dataframes will not have consecutive index values.
We can use the reset_index function in those cases. I have dropped some rows of the groceries dataframe:

As you can see, some indices are skipped. We can now use the reset_index function.
groceries.reset_index(drop=True, inplace=True)
The order of values are the same but the index is reset. The drop parameter is important. If we do not set it as True, the old index will be kept as a new column in the dataframe. The inplace parameter ensures the changes are saved.
10. The unique values
The unique function returns an array of the unique values in a column.
groceries['itemDescription'].unique()[:5]array(['tropical fruit', 'whole milk', 'pip fruit', 'other vegetables','rolls/buns'], dtype=object)I have only displayed the first 5 elements for demonstration purposes.
11. The number of unique values
If we are only interested in the number of unique values, we can use the nunique function. It can be called on the entire dataframe or a particular column.
groceries.nunique()Member_number 3898
Date 728
itemDescription 167There are other ways to count the number of unique values in a column. For instance, the length of the array returned by the unique function gives us the number of unique values.
11. Creating a random sample of larger size
The sample function can be used to create a random sample of the rows of a dataframe. It comes in handy when working with unbalanced datasets in machine learning.
We can only create samples that are smaller than the original one unless the replace parameter is changed to true. The replace parameter allows for using the same row more than once.
Let us create a random sample of the groceries dataframe by only using the rows with a spent amount of less than 300.
less = marketing[marketing.AmountSpent < 300].sample(n=400, replace=True)less.shape
(400, 10)12. Combining dataframes
We can concatenate dataframes horizontally or vertically with the concat function. The axis parameter is used to determine the axis through which the concatenation occurs.
We can concatenate the marketing and sample dataframes we created in the previous examples.
less.shape, marketing.shape
((400, 10), (1000, 10))new = pd.concat([marketing, less])new.shape
(1400, 10)The default values of the axis parameter is 0 which means concatenating along index. The dataframes must have the same number of columns.
13. Selecting a range of rows and columns by index
We can select a range of rows and columns by using the iloc function. It accepts the indices of the desired rows and columns. The pandas dataframes have integer index for both rows and columns.
For instance, we can select the first 4 rows and the first 3 columns as follows:

14. Selecting specific rows and columns by index
The iloc function also accepts an array of values instead of ranges. We can pass a list or numpy array.
toselect = np.random.randint(100, size=7)marketing.iloc[toselect, [2,4,6]]
We have created a numpy array of 7 random integers between 0 and 100. We have passed this array to the iloc function along with a list of 3 columns to be selected.
15. Selecting rows and columns by label
The loc function is just like the iloc function but it accepts labels instead of indices.

I have replicated the selection in the previous example to point out the difference between the loc and iloc functions. You may have noticed that the same array is used for the row part. The reason is that the labels and indices of the rows are the same unless we assign different labels for rows.
16. Extracting the year and month from dates
Pandas provides lots of functions to operate on the dates. They are used through the dt accessor.
We can easily extract the year and month from dates as follows:
groceries['Year'] = groceries['Date'].dt.year
groceries['Month'] = groceries['Date'].dt.month
17. Dropping columns and rows
In the previous example, we created two new columns. By default, pandas add the new columns at the end of a dataframe but we can change it.
We will add the new columns at a specific position in the next example. However, we first need to drop them which can be done by using the drop function.
groceries.drop(['Year','Month'], axis=1, inplace=True)We pass the list of columns or rows to be dropped. The axis parameter needs to be 1 to drop columns and 0 to drop rows.
18. Inserting a column
The year and month columns might look better if they are placed before the date column. We can use the insert function to accomplish this task.
year = groceries['Date'].dt.year
month = groceries['Date'].dt.monthgroceries.insert(1, 'Month', month)
groceries.insert(2, 'Year', year)
19. Replacing values
In the previous example, we created a month column that contains numbers to represent months. You may want this column to contain the names of months (i.e. January, February, and so on).
There are multiple ways to do this operation. I will first show you the harder way. In the next example, we will see a much simpler method.
We can use the replace function to replace integers with strings of month names.
month_names = {1:'January', 2:'February', 3:'March', 4:'April',
5: 'May', 6:'June', 7:'July', 8:'August', 9:'September',
10:'October', 11:'November', 12:'December'}groceries.Month.replace(month_names, inplace=True)We have created a dictionary that indicates the replacements and then passed it to the replace function.
20. Month name
There is much simpler way of doing the task in the previous step. We will make a function of the dt accessor.
groceries['Month'] = groceries['Date'].dt.month_name()
We can directly retrieve the month name from the date column. It is important to note that the functions of dt accessor can only be used with datetime like values.
21. The cumulative sum
The cumsum function allows to create a column based on the cumulative sum of another column. Consider the marketing dataframe. We can create a column that contains the cumulative sum of the spent amount.
marketing['CumAmountSpent'] = marketing['AmountSpent'].cumsum()
22. Filtering strings
In the previous example, we have mentioned the dt accessor which makes it very easy and simple to deal with dates. Similarly, the str accessor provides many functions and methods that expedite to process textual data.
For instance, we can check if strings contain a specific set of characters. A typical use case would be to count the number of rows that contains the word “milk” in the description column of the groceries table.
groceries.itemDescription.str.contains('milk').sum()
3186groceries.itemDescription.str.contains('whole milk').sum()
2502The contains function returns true if a value contains the given string. By applying the sum function (1 for each true value), we calculate the total number of rows that contains the word “milk”.
23. Filtering strings based on length
We can also filter string based on the length (i.e. number of characters). Let us find the items with long descriptions.
groceries[groceries.itemDescription.str.len() > 20]\
.itemDescription.unique()array(['fruit/vegetable juice', 'packaged fruit/vegetables',
'frozen potato products', 'Instant food products',
'female sanitary products', 'house keeping products',
'chocolate marshmallow', 'long life bakery product',
'flower soil/fertilizer', 'preservation products'], dtype=object)The filtering is on the item description column and the descriptions that are longer than 20 characters are selected.
24. Plotting the distribution of a variable
Pandas is not a data visualization library so it is not optimized for visualization tasks. However, it provides plotting functions which I think make it highly convenient to produce basic plots.
For instance, we can create a kde plot to see the distribution of the salary column.
marketing.Salary.plot(kind='kde', title='Distribution of Salary',
figsize=(10,6))
25. Creating a histogram
Histograms are also commonly used to check the distribution of a numerical feature. We can use the plot function to produce histograms as well.
marketing.Salary.plot(kind='hist', title='Distribution of Salary',
figsize=(10,6))
26. Trend in the monthly sales
In this example, we will combine a few operations to create a plot that shows the trend in monthly sales. The first step is to create a month column as we did previously.
groceries['month_name'] = groceries['Date'].dt.month_name()We will calculate the number of items sold in each month by using the group by function and then plot the values.
groceries[['month_name','Date']].groupby('month_name')\
.count().plot(title="Monthly Sales", figsize=(10,6))
You may have noticed that we did not use the kind parameter of the plot function. The reason is that the default value of kind parameter produces a line plot which is what we need in our case.
27. Different aggregate funtions to different columns
It is possible to apply different aggregate functions to different columns in the group by function. We can pass a dictionary to indicate which functions will be applied to which columns.
marketing[['Married','Salary','AmountSpent']].groupby(['Married'])\
.agg({'Salary':'mean', 'AmountSpent':'sum'})
We have calculated the average salary and total spent amount by each group in the married column. However, it would be better if we also somehow indicate which functions are applied to each column.
The solution is the NamedAgg method.
28. NamedAgg in group by
We will do the same operation as in the previous example but only change the column names in the result.
marketing[['Married','Salary','AmountSpent']].groupby(['Married'])\
.agg(
Average_salary = pd.NamedAgg('Salary', 'mean'),
Total_spent = pd.NamedAgg('AmountSpent', 'sum')
)
29. Crosstab function
The cross tab function is used to create a cross table based on specified columns, values, and aggregate functions. It is similar to a pivot table.
For instance, we can calculate the average salary of cross categories between the age and gender columns.
pd.crosstab(index=marketing.Age, columns=marketing.Gender, values=marketing.Salary, aggfunc='mean').round(1)
The middle aged males have the highest average salary.
30. Crosstab function — 2
We will do a slightly more complex example with the crosstab function. We can pass multiple columns and also display the overall values.
pd.crosstab(index=[marketing.Age, marketing.Married], columns=marketing.Gender,values=marketing.Salary, aggfunc='mean',
margins=True).round(1)
This cross table is more informative than the previous one as it includes more specific categories and overall average values.
31. Pivot_table function
It is extremely similar to the crosstable function with a few small differences in the syntax. I will create the same table as in the previous example using the pivot_table function.
pd.pivot_table(data=marketing, index=['Age', 'Married'], columns='Gender', values='Salary', aggfunc='mean',
margins=True).round(1)
We can pass the dataframe to the data parameter and use the column names as strings.
32. Splitting strings
The string accessor can be used to split or combine strings. For instance, we can split the parts of the date in the groceries dataframe to obtain day, month, and year values.
Please note that the data type should be object or string to be able apply str accessor.
Note: If you have the date column stored with datetime64[ns] data type, convert it back to “string” data type in order to apply the following splitting operation. Converting back to “object” will not allow to use str accessor for some reason. However, if you have it stored as “object” or “string” at the first place, you can apply the str accessor.
groceries['month'] = groceries['Date']\
.str.split('-', expand=True)[1]We have splitted the date colum at “-” characted. The expand parameter is set as true to create a different column for each part. We have selected the second column ([1]) which is the month.

33. Splitting strings on character level
We can select part of strings based on the position of characters. Consider the previous example. We may want to retrieve the last two characters of the years (e.g. 15 instead of 2015).
The str accessor allows indexing on strings.
groceries['year'] = groceries['Date']\
.str.split('-', expand=True)[2].str[-2:]
34. Sidetable
Sidetable is an add-on for Pandas which makes it easier to create summaries of dataframes. It can be considered as a combination of value counts and cross tab functions.
Once installed, it can be used as other accessors such as str and dt.
pip install sidetable
import sidetablegroceries.stb.freq(['itemDescription'], thresh=25)
Freq function returns a dataframe that conveys 3 pieces of information.
- The number of observations (i.e. rows) for each category (value_counts()).
- The percentage of each category in the entire column (value_counts(normalize=True)).
- The cumulative versions of the two above.
Sidetable offers more functionality. I wrote a detailed article about sidetable if you’d like to read further.
35. Finding missing values
Missing values need to be handled very carefully in order to make accurate and robust analysis.
The isna function can be used the find the missing values in a dataframe. It returns true if the value is missing. Thus, we can count the total number of missing values by applying the sum function.
groceries.isna().sum()Member_number 0
Date 0
itemDescription 0We do not have any missing values in the groceries dataframe.
36. Handling missing values
The fillna function can be used to handle missing values. It provides many options to fill missing values such as mean, median, or a constant value.
We can also use the previous or next value to fill a missing value.
Let us first change a few values as missing value in the groceries dataframe.
groceries.iloc[[1,10,30], [1,2]] = np.nangroceries.isna().sum()
Member_number 0
Date 3
itemDescription 3We can use the most frequent item to fill missing values in the item description column. For the date column, we will use the previous value to replace a missing value.
groceries['itemDescription']\
.fillna(value=groceries['itemDescription'].mode()[0], inplace=True)groceries['Date'].fillna(method='ffill', inplace=True)groceries.isna().sum()
Member_number 0
Date 0
itemDescription 037. Selecting data types
The select_dtypes function can be used select columns that belong or does not belong to a particular data type.
marketing.select_dtypes(include='object').columns
Index(['Age', 'Gender', 'OwnHome', 'Married', 'Location', 'History'], dtype='object')marketing.select_dtypes(exclude='object').columns
Index(['Salary', 'Children', 'Catalogs', 'AmountSpent'], dtype='object')We can include or exclude certain data types.
38. Creating a dataframe
The DataFrame function can be used to create a dataframe. A dictionary can be passed to the DataFrame function. The keys will be the column names and the values will represent the row values.
Let’s create a dataframe that contains the prices of the items in the groceries dataframe.
unique_items = groceries.itemDescription.unique()prices = pd.DataFrame({
'itemDescription': unique_items,
'prices':np.random.randint(10, size=len(unique_items))
})
We assign the prices randomly by creating a numpy array of random integers between 0 and 10.
39. Merging dataframes
The merge function can be used to merge two dataframes based on a shared column or columns. For instance, we can merge the groceries and price dataframes based on the item description column.
merged_df = groceries.merge(prices, on='itemDescription')
40. Correlations
When working on a machine learning task, the correlations between numerical variables need to be taken into consideration.
The corr function calculates the correlations and returns a matrix that contains correlation coefficients between variables.

As we can see, the salary and spent amount is highly correlated.
Conclusion
In this article and the previous one, we have covered a great deal of the functions and methods of Pandas.
As you keep using pandas for your data analysis tasks, you may discover new functions and methods. As with any other subject, practice makes perfect.
Thank you for reading. Please let me know if you have any feedback.
