Free AI web copilot to create summaries, insights and extended knowledge, download it at here

7967

Abstract

= <span class="hljs-string">'Pclass'</span>)<span class="hljs-selector-attr">[<span class="hljs-string">'Fare'</span>]</span><span class="hljs-selector-class">.agg</span>(<span class="hljs-string">'count'</span>)</pre></div><p id="9c9e">Output:</p><div id="eba9"><pre><span class="hljs-attribute">Pclass</span> <span class="hljs-attribute">1</span> <span class="hljs-number">216</span> <span class="hljs-attribute">2</span> <span class="hljs-number">184</span> <span class="hljs-attribute">3</span> <span class="hljs-number">491</span> <span class="hljs-attribute">Name</span>: Fare, dtype: int64</pre></div><p id="eace">You can also add a column with the data count of a single feature.</p><p id="c20f">Here I am adding a new column name ‘freq’ that contains the data count of ‘Pclass’ for ‘Fare’.</p><div id="d294"><pre>df<span class="hljs-selector-attr">[<span class="hljs-string">'freq'</span>]</span> = df<span class="hljs-selector-class">.groupby</span>(by=<span class="hljs-string">'Pclass'</span>)<span class="hljs-selector-attr">[<span class="hljs-string">'Fare'</span>]</span><span class="hljs-selector-class">.transform</span>(<span class="hljs-string">'count'</span>)</pre></div><figure id="d147"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*cTlckyhIe8s60htWay3ERg.png"><figcaption></figcaption></figure><p id="a184">That was all for the count.</p><h2 id="7cb2">Value_Count</h2><p id="80db">Value counts function is a bit more efficient in some respect. It is possible to achieve a bit more with a smaller amount of code. We used the group_by function and count to find the data count of individual ‘Pclass’ before. That can be done even more easily using value_count:</p><div id="fe72"><pre>df[<span class="hljs-string">'Pclass'</span>].value_counts(<span class="hljs-built_in">sort</span> = <span class="hljs-literal">True</span>, ascending = <span class="hljs-literal">True</span>)</pre></div><p id="3594">Output:</p><div id="5ce2"><pre><span class="hljs-attribute">2</span> <span class="hljs-number">184</span> <span class="hljs-attribute">1</span> <span class="hljs-number">216</span> <span class="hljs-attribute">3</span> <span class="hljs-number">491</span> <span class="hljs-attribute">Name</span>: Pclass, dtype: int64</pre></div><p id="c9f2">Here not only we got the value count, but also got it sorted. If you do not need it sorted, just don’t use the ‘sort’ and ‘ascending’ parameters in it.</p><p id="881c">The values can be normalized as well using the normalize parameter:</p><div id="afd9"><pre>df[<span class="hljs-string">'Pclass'</span>].value_counts(<span class="hljs-attribute">normalize</span>=<span class="hljs-literal">True</span>)</pre></div><p id="dbaa">Output:</p><div id="a2b1"><pre><span class="hljs-attribute">3</span> <span class="hljs-number">0</span>.<span class="hljs-number">551066</span> <span class="hljs-attribute">1</span> <span class="hljs-number">0</span>.<span class="hljs-number">242424</span> <span class="hljs-attribute">2</span> <span class="hljs-number">0</span>.<span class="hljs-number">206510</span> <span class="hljs-attribute">Name</span>: Pclass, dtype: float64</pre></div><p id="3d20">One last thing I want to show on value_counts is the making of the bins. Here I divided the Fare into three bins:</p><div id="2891"><pre>df<span class="hljs-selector-attr">[<span class="hljs-string">'Fare'</span>]</span><span class="hljs-selector-class">.value_counts</span>(bins = <span class="hljs-number">3</span>)</pre></div><p id="31f7">Output:</p><div id="56ec"><pre>(-<span class="hljs-number">0</span>.<span class="hljs-number">513</span>, <span class="hljs-number">170.776</span>] <span class="hljs-number">871</span> (<span class="hljs-number">170.776</span>, <span class="hljs-number">341.553</span>] <span class="hljs-number">17</span> (<span class="hljs-number">341.553</span>, <span class="hljs-number">512.329</span>] <span class="hljs-number">3</span> Name: Fare, dtype: int64</pre></div><h2 id="e364">Crosstab</h2><p id="be82">The crosstab function can do even more work for us in a single line of code. The most simple way of using crosstab is here:</p><div id="7f2a"><pre>pd<span class="hljs-selector-class">.crosstab</span>(df<span class="hljs-selector-attr">[<span class="hljs-string">'Sex'</span>]</span>, df<span class="hljs-selector-attr">[<span class="hljs-string">'Pclass'</span>]</span>)</pre></div><figure id="00ad"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UNgN4ny2XuUNOQrVoWqUGA.png"><figcaption></figcaption></figure><p id="e0bb">If it is necessary to get the total number at the end, we get the total for rows and columns both:</p><div id="c4b6"><pre>pd<span class="hljs-selector-class">.crosstab</span>(df<span class="hljs-selector-attr">[<span class="hljs-string">'Sex'</span>]</span>, df<span class="hljs-selector-attr">[<span class="hljs-string">'Pclass'</span>]</span>, margins = True, margins_name = <span class="hljs-string">"Total"</span>)</pre></div><figure id="8e34"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*HGHr-5IeTP65be9vVyeIZA.png"><figcaption></figcaption></figure><p id="ec49">We can get the normalized values the way we did with the value counts function:</p><figure id="823a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*87-fq3Kmpq_GwKHmFCqxsg.png"><figcaption></figcaption></figure><p id="e0cd">If you add all the values in this table that will be one. So, the normalization was done based on the total of all the values. But what if we need to normalize based on gender or Pclass only. That is also possible.</p><div id="f5d7"><pre>pd<span class="hljs-selector-class">.crosstab</span>(df<span class="hljs-selector-attr">[<span class="hljs-string">'Sex'</span>]</span>, df<span class="hljs-selector-attr">[<span class="hljs-string">'Pclass'</span>]</span>, normalize=<span class="hljs-string">'columns'</span>)</pre></div><figure id="9d48"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jnS8yfMGo3oNpE1Ey6s0GQ.png"><figcaption></figcaption></figure><p id="38b9">Each column adds up to 1. So, the table above shows the proportion of males and females of each ‘Pclass’.</p><p id="3bca">We can normalize by the index of the table to find the proportion of people in ‘Pclass’ in each gender.</p><div id="7041"><pre>pd<span class="hljs-selector-class">.crosstab</span>(df<span class="hljs-selector-attr">[<span class="hljs-string">'Sex'</span>]</span>, df<span class="hljs-selector-attr">[<span class="hljs-string">'Pclass'</span>]</span>, normalize=<span class="hljs-string">'index'</span>)</pre></div><figure id="4838"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*nEPqHGwKjpv2To4GKIgzwg.png"><figcaption></figcaption></figure><p id="ad06">In the table above each adds up to 1.</p><p id="69b4">The next example finds the mean ‘Fare’ for each ‘Pclass’ and each gender. The values are rounding up to 2 decimal points.</p><div id="0dc8"><pre>pd<span class="hljs-selector-class">.crosstab</span>(df<span class="hljs-selector-attr">[<span class="hljs-string">'Sex'</span>]</span>, df<span class="hljs-selector-attr">[<span class="hljs-string">'Pclass'</span>]</span>, values = df<span class="hljs-selector-attr">[<span class="hljs-string">'Fare'</span>]</span>, aggfunc = <span class="hljs-string">"mean"</span>)<span class="hljs-selector-class">.round</span>(<span class="hljs-number">2</span>)</pre></div><figure id="1e9d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Z8g_Bdfh4JwWazkUj4CpjA.png"><figcaption></figcaption></figure><p id="65ff">All this time we used one layer in the row direction and one layer in the column direction. Here, I am using two layers of data on the column direction:</p><div id="e7a7"><pre>pd.crosstab(df[<span class="hljs-string">'Pclass'</span>], [df[<span class="hljs-string">'Sex'</span>], df[<span class="hljs-string">'Survived'</span>]])</pre><

Options

/div><figure id="a2b0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*E3tgxHca198hMabhSHaW-w.png"><figcaption></figcaption></figure><p id="41a3">This table shows, how many people survived in each Passenger class per gender. Using the normalize function, we can find the proportion as well.</p><div id="955f"><pre>pd.crosstab(df[<span class="hljs-string">'Pclass'</span>], [df[<span class="hljs-string">'Sex'</span>], df[<span class="hljs-string">'Survived'</span>]], normalize = <span class="hljs-string">'columns'</span>)</pre></div><figure id="12fe"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9AZo1NqfCP6lo2DmIuPH4w.png"><figcaption></figcaption></figure><p id="30ff">Let’s use two layers in the column and two layers in the rows:</p><div id="d661"><pre>pd.crosstab([df[<span class="hljs-string">'Pclass'</span>], df[<span class="hljs-string">'Sex'</span>]], [df[<span class="hljs-string">'Embarked'</span>], df[<span class="hljs-string">'Survived'</span>]], rownames = [<span class="hljs-string">'Pclass'</span>, <span class="hljs-string">'gender'</span>], colnames = [<span class="hljs-string">'Embarked'</span>, <span class="hljs-string">'Survived'</span>], dropna=<span class="hljs-symbol">False</span>)</pre></div><figure id="e0c1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*TQceUUSr_TEhKWOikfnM3w.png"><figcaption></figcaption></figure><p id="cace">So much information packed in this one table. It looks even better and nicer in a heatmap.</p><div id="05a2"><pre><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt <span class="hljs-title">plt</span>.figure(figsize=(<span class="hljs-number">8</span>,<span class="hljs-number">6</span>))</pre></div><div id="fc71"><pre>sns<span class="hljs-selector-class">.heatmap</span>(pd<span class="hljs-selector-class">.crosstab</span>(<span class="hljs-selector-attr">[df[<span class="hljs-string">'Pclass'</span>]</span>, df<span class="hljs-selector-attr">[<span class="hljs-string">'Sex'</span>]</span>], <span class="hljs-selector-attr">[df[<span class="hljs-string">'Embarked'</span>]</span>, df<span class="hljs-selector-attr">[<span class="hljs-string">'Survived'</span>]</span>]), cmap = <span class="hljs-string">"YlGnBu"</span>,annot = True) plt<span class="hljs-selector-class">.show</span>()</pre></div><figure id="7d30"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ltNNUkzGT3iLjSirfVUcqA.png"><figcaption></figcaption></figure><p id="2769">In the x-direction, it shows the ‘Embarked’ and ‘Survived’ data. In the y-direction, it shows the Passenger class and gender.</p><p id="2519">I have a video tutorial as well:</p> <figure id="46bd"> <div> <div> <img class="ratio" src="http://placehold.it/16x9"> <iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FCZxHvtgNelQ%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DCZxHvtgNelQ&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FCZxHvtgNelQ%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtube" allowfullscreen="" frameborder="0" height="480" width="854"> </div> </div> </figure></iframe></div></div></figure><h2 id="6d2d">Conclusion</h2><p id="7d96">This article shows some very popular functions in detail to summarize the data. There are many ways to summarize the data. These are some simple and useful ways.</p><p id="e244">Feel free to follow me on <a href="https://twitter.com/rashida048">Twitter</a> and F<a href="https://www.facebook.com/rashida.smith.161">acebook</a>.</p><h2 id="a3cb">More Reading</h2><div id="cd47" class="link-block"> <a href="https://towardsdatascience.com/an-ultimate-cheatsheet-of-data-visualization-in-seaborn-be8ed13a3697"> <div> <div> <h2>An Ultimate Cheatsheet of Data Visualization with Python’s Seaborn Library</h2> <div><h3>A great resource for learners as well</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*ZLwSnLp1FkhMqJZk)"></div> </div> </div> </a> </div><div id="d04c" class="link-block"> <a href="https://towardsdatascience.com/an-ultimate-cheat-sheet-for-numpy-bb1112b0488f"> <div> <div> <h2>An Ultimate Cheat Sheet for Numpy</h2> <div><h3>All the Numpy Functions You Need for Your Everyday Work</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*AqhFJ9lKqCItvz5d)"></div> </div> </div> </a> </div><div id="fea0" class="link-block"> <a href="https://towardsdatascience.com/efficient-data-summarizing-and-analysis-using-pandas-groupby-function-7b2cc9eff186"> <div> <div> <h2>Efficient Data Summarizing and Analysis Using Pandas’ Groupby Function</h2> <div><h3>Learn to Use Aggregate Functions, Data Transformation, Filter, Map, Apply in the DataFrame</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*B2WikLj7A3LSRF2W)"></div> </div> </div> </a> </div><div id="c05f" class="link-block"> <a href="https://towardsdatascience.com/all-the-datasets-you-need-to-practice-data-science-skills-and-make-a-great-portfolio-857a348883b5"> <div> <div> <h2>All the Datasets You Need to Practice Data Science Skills and Make a Great Portfolio</h2> <div><h3>Some Interesting Datasets to Upscale You Skills and Portfolio</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*CeYdLDN3bcnozgRH)"></div> </div> </div> </a> </div><div id="241f" class="link-block"> <a href="https://towardsdatascience.com/sort-and-segment-your-data-into-bins-to-get-sorted-ranges-pandas-cut-and-qcut-7785931bbfde"> <div> <div> <h2>Data Binning with Pandas Cut or Qcut Method</h2> <div><h3>When You Are Looking for a Range Not an Exact Value, a Grade Not a Score</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*-MNziZYvvPu6RDrW)"></div> </div> </div> </a> </div><div id="1959" class="link-block"> <a href="https://towardsdatascience.com/a-complete-guide-to-confidence-interval-and-examples-in-python-ff417c5cb593"> <div> <div> <h2>A Complete Guide to Confidence Interval, and Examples in Python</h2> <div><h3>Deep Understanding of Confidence Interval and Its Calculation, a Very Popular Parameter in Statistics</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*f3rRqBSL-A7mQbRL)"></div> </div> </div> </a> </div></article></body>

Three Very Useful Functions of Pandas to Summarize the Data

Pandas count, value_count, and crosstab functions in details

Pandas library is a very popular python library for data analysis. Pandas library has so many functions. This article will discuss three very useful and widely used functions for data summarizing. I am trying to explain it with examples so we can use them to their full potential.

The three functions I am talking about today are count, value_count, and crosstab.

The count function is the simplest. The value_count can do a bit more and the crosstab function does even more complicated work with simple commands.

The famous Titanic dataset is used for this demonstration. Please feel free to download the dataset and follow along from this link:

rashida048/Datasets

Contribute to rashida048/Datasets development by creating an account on GitHub.

github.com

First import the necessary packages and the dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("titanic_data.csv")
df.head()

How many rows and columns are in the dataset?

df.shape

Output:

(891, 12)

The dataset has 891 rows of data and 12 columns.

Count

This is a simple function. But it is very useful for initial checking. We just learned that there are 891 rows in the dataset. In an ideal case, we should have 891 data in all 12 columns. But that doesn’t happen all the time. Most of the time we have to deal with null values. If you notice the first five rows of the dataset, there are some NaN values.

How much data are there in each column?

df.count(0)

Output:

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In most columns, we have 891 data. But not in all the columns.

Let’s check the same for rows. How much real data are there in each row?

df.count(1)

Output:

0      11
1      12
2      11
3      12
4      11
       ..
886    11
887    12
888    10
889    12
890    11
Length: 891, dtype: int64

If there are indexes in the dataset, we can check the data count by index level. To demonstrate that I need to set indexes first. I will set two columns as indexes.

df = df.set_index(['Sex', 'Pclass'])
df

Now, the dataset has two indexes: ‘Sex’ and ‘Pclass’. Applying count on ‘Sex’ will show the data count of each gender:

df.count(level = "Sex")

In the same way, applying count on ‘Pclass’ will show the data count of each passenger class of all the features:

df.count(level = 'Pclass')

I just want to reset the index now to bring it back to its original shape:

df = df.reset_index()

There is no index anymore.

Before, we saw how to get the data count for all the columns together. The next example shows how to get the data count for an individual column. Here is how to get the data count for the ‘Fare’ column for each ‘Pclass’.

df.groupby(by = 'Pclass')['Fare'].agg('count')

Output:

Pclass
1    216
2    184
3    491
Name: Fare, dtype: int64

You can also add a column with the data count of a single feature.

Here I am adding a new column name ‘freq’ that contains the data count of ‘Pclass’ for ‘Fare’.

df['freq'] = df.groupby(by='Pclass')['Fare'].transform('count')

That was all for the count.

Value_Count

Value counts function is a bit more efficient in some respect. It is possible to achieve a bit more with a smaller amount of code. We used the group_by function and count to find the data count of individual ‘Pclass’ before. That can be done even more easily using value_count:

df['Pclass'].value_counts(sort = True, ascending  = True)

Output:

2    184
1    216
3    491
Name: Pclass, dtype: int64

Here not only we got the value count, but also got it sorted. If you do not need it sorted, just don’t use the ‘sort’ and ‘ascending’ parameters in it.

The values can be normalized as well using the normalize parameter:

df['Pclass'].value_counts(normalize=True)

Output:

3    0.551066
1    0.242424
2    0.206510
Name: Pclass, dtype: float64

One last thing I want to show on value_counts is the making of the bins. Here I divided the Fare into three bins:

df['Fare'].value_counts(bins = 3)

Output:

(-0.513, 170.776]     871
(170.776, 341.553]     17
(341.553, 512.329]      3
Name: Fare, dtype: int64

Crosstab

The crosstab function can do even more work for us in a single line of code. The most simple way of using crosstab is here:

pd.crosstab(df['Sex'], df['Pclass'])

If it is necessary to get the total number at the end, we get the total for rows and columns both:

pd.crosstab(df['Sex'], df['Pclass'], margins = True, margins_name = "Total")

We can get the normalized values the way we did with the value counts function:

If you add all the values in this table that will be one. So, the normalization was done based on the total of all the values. But what if we need to normalize based on gender or Pclass only. That is also possible.

pd.crosstab(df['Sex'], df['Pclass'], normalize='columns')

Each column adds up to 1. So, the table above shows the proportion of males and females of each ‘Pclass’.

We can normalize by the index of the table to find the proportion of people in ‘Pclass’ in each gender.

pd.crosstab(df['Sex'], df['Pclass'], normalize='index')

In the table above each adds up to 1.

The next example finds the mean ‘Fare’ for each ‘Pclass’ and each gender. The values are rounding up to 2 decimal points.

pd.crosstab(df['Sex'], df['Pclass'], values = df['Fare'], aggfunc = "mean").round(2)

All this time we used one layer in the row direction and one layer in the column direction. Here, I am using two layers of data on the column direction:

pd.crosstab(df['Pclass'], [df['Sex'], df['Survived']])

This table shows, how many people survived in each Passenger class per gender. Using the normalize function, we can find the proportion as well.

pd.crosstab(df['Pclass'], [df['Sex'], df['Survived']], normalize = 'columns')

Let’s use two layers in the column and two layers in the rows:

pd.crosstab([df['Pclass'], df['Sex']], [df['Embarked'], df['Survived']],
           rownames = ['Pclass', 'gender'],
           colnames = ['Embarked', 'Survived'],
           dropna=False)

So much information packed in this one table. It looks even better and nicer in a heatmap.

import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))

sns.heatmap(pd.crosstab([df['Pclass'], df['Sex']], [df['Embarked'], df['Survived']]), cmap = "YlGnBu",annot = True)
plt.show()

In the x-direction, it shows the ‘Embarked’ and ‘Survived’ data. In the y-direction, it shows the Passenger class and gender.

I have a video tutorial as well:

Conclusion

This article shows some very popular functions in detail to summarize the data. There are many ways to summarize the data. These are some simple and useful ways.

Feel free to follow me on Twitter and Facebook.

An Ultimate Cheatsheet of Data Visualization with Python’s Seaborn Library

A great resource for learners as well

towardsdatascience.com

An Ultimate Cheat Sheet for Numpy

All the Numpy Functions You Need for Your Everyday Work

towardsdatascience.com

Efficient Data Summarizing and Analysis Using Pandas’ Groupby Function

Learn to Use Aggregate Functions, Data Transformation, Filter, Map, Apply in the DataFrame

towardsdatascience.com

All the Datasets You Need to Practice Data Science Skills and Make a Great Portfolio

Some Interesting Datasets to Upscale You Skills and Portfolio

towardsdatascience.com

Data Binning with Pandas Cut or Qcut Method

When You Are Looking for a Range Not an Exact Value, a Grade Not a Score

towardsdatascience.com

A Complete Guide to Confidence Interval, and Examples in Python

Deep Understanding of Confidence Interval and Its Calculation, a Very Popular Parameter in Statistics

towardsdatascience.com