avatarSuraj Gurav

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5431

Abstract

ining the bin edges in the bin parameter.</li><li>Each bin edge is closed on the right — It is coming from the default setting of the parameter right as <code>right=True</code>. It means that the pandas include the maximum value of the bucket in the same bucket. This parameter specifically helps you <b>control the binning process</b> and switching its value helps you include or exclude certain elements from a bin.</li></ol><p id="4b50">Let’s give it a second chance.</p><p id="e5ab">This time you’ll pass a list of bin edges for the same DataFrame column and see how the result changes.</p><div id="11db"><pre>df[<span class="hljs-string">"binned_Series1_defined_binedge"</span>] = pd.cut(df[<span class="hljs-string">"Series1"</span>], bins=[<span class="hljs-number">0</span>, <span class="hljs-number">10</span>, <span class="hljs-number">15</span>, <span class="hljs-number">40</span>, <span class="hljs-number">65</span>, <span class="hljs-number">100</span>])</pre></div><figure id="0019"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*aGWxOgAi-pRRzwwngcvNBw.png"><figcaption>pandas cut with defined bin edges | Image by Author</figcaption></figure><p id="9da7">Pandas simply created new bins using the integers you provided in the <i>bin</i> parameter and assigned each number of <i>Series1</i> to these bins.</p><p id="5f91">Moreover, you can also use the <i>Label</i> parameter to give a name to each of these buckets, like below.</p><div id="5867"><pre>df[<span class="hljs-string">"bin_name"</span>] = pd.cut(df[<span class="hljs-string">"Series1"</span>], bins=[<span class="hljs-number">0</span>, <span class="hljs-number">10</span>, <span class="hljs-number">15</span>, <span class="hljs-number">40</span>, <span class="hljs-number">65</span>, <span class="hljs-number">100</span>], labels=[<span class="hljs-string">'bin 1'</span>, <span class="hljs-string">'bin 2'</span>, <span class="hljs-string">'bin 3'</span>, <span class="hljs-string">'bin 4'</span>, <span class="hljs-string">'bin 5'</span>])</pre></div><figure id="00a7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*TfPBNrZsW7a6zSPHqEvRsw.png"><figcaption>pandas cut() with bin labels | Image by Author</figcaption></figure><p id="9406">It works perfectly as expected!</p><p id="2f14">Coming back to my work —<b> a real-world scenario </b>— I tried the function <code><b>pandas.cut()</b></code> on my below dataset.</p><div id="07ad"><pre><span class="hljs-comment"># Create a sample DataFrame as I can not disclose the original data</span> HHI = [random.random() <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">10</span>)] Brands = [<span class="hljs-string">"Brand_1"</span>, <span class="hljs-string">"Brand_2"</span>, <span class="hljs-string">"Brand_3"</span>, <span class="hljs-string">"Brand_4"</span>, <span class="hljs-string">"Brand_5"</span>, <span class="hljs-string">"Brand_6"</span>, <span class="hljs-string">"Brand_7"</span>, <span class="hljs-string">"Brand_8"</span>, <span class="hljs-string">"Brand_9"</span>, <span class="hljs-string">"Brand_10"</span>]

df = pd.DataFrame({<span class="hljs-string">"brand"</span>: Brands, <span class="hljs-string">"hhi"</span>: HHI})

<span class="hljs-comment"># Use pandas.cut()</span> df[<span class="hljs-string">"binned_hhi"</span>] = pd.cut(df[<span class="hljs-string">"hhi"</span>], bins=<span class="hljs-number">3</span>) df[<span class="hljs-string">"brand_bucket"</span>] = pd.cut(df[<span class="hljs-string">"hhi"</span>], bins=<span class="hljs-number">3</span>, labels = [<span class="hljs-string">"low"</span>, <span class="hljs-string">"medium"</span>, <span class="hljs-string">"high"</span>]) df</pre></div><figure id="75ec"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jCwi7-eCq7K6i12Wvdak2g.png"><figcaption>Using pandas.cut() on a real-world example | Image by Author</figcaption></figure><p id="4799">However, the distribution of elements in each of these buckets is uneven, i.e. each bin contains a different number of elements. 5 brands belong to the <i>low</i>, 3 brands belong to the <i>medium</i>, and only 2 brands belong to the <i>high</i> concentration bucket.</p><p id="c6fb">But for my project, I wanted to keep the distribution i.e. the number of brands in each bucket same and that’s where I found the next pandas method useful.</p><h1 id="770a">pandas qcut()</h1><p id="17ec"><code>pandas.qcut()</code> is used to get an equal data distribution in all the bins. It works on the principle of sample quantiles.</p><blockquote id="258f"><p>Quantiles are the values that divide a series into a number of subsets — each containing nearly the same number of elements.</p></blockquote><p id="f57e">So when you cut a series using the function <b>qcut()</b>, it simply tells you which element`of the series belongs to which quantile.</p><p id="fb9e">The basic syntax of the function <code>qcut()</code> is almost the same as the syntax of the function <code>cut()</code>.</p><p id="c331">Let’s understand this with an example — Here you’ll use both the functions <code>cut()</code> and <code>qcut()</code> on the same data and categorize them into 4

Options

bins.</p><div id="653a"><pre>Series1 = pd.Series([<span class="hljs-number">17</span>, <span class="hljs-number">47</span>, <span class="hljs-number">35</span>, <span class="hljs-number">6</span>, <span class="hljs-number">6</span>, <span class="hljs-number">16</span>, <span class="hljs-number">78</span>, <span class="hljs-number">14</span>, <span class="hljs-number">79</span>, <span class="hljs-number">98</span>]) df = pd.DataFrame({<span class="hljs-string">"Series1"</span>: Series1})

df[<span class="hljs-string">"qcut_Series1"</span>] = pd.qcut(df[<span class="hljs-string">"Series1"</span>], q=<span class="hljs-number">4</span>) <span class="hljs-comment"># Use qcut()</span> df[<span class="hljs-string">"cut_Series1"</span>] = pd.cut(df[<span class="hljs-string">"Series1"</span>], bins=<span class="hljs-number">4</span>) <span class="hljs-comment"># Use cut()</span></pre></div><figure id="5a22"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*HpgrFkUJfGqTu5C3PFHh1w.png"><figcaption>Quantile-based discretization Python | Image by Author</figcaption></figure><p id="7481">Now, when you check the data distribution in each bin —</p><div id="9bbe"><pre><span class="hljs-comment"># Check the data distribution of each bucket when cut() was used</span> df[<span class="hljs-string">"cut_Series1"</span>].value_counts()

<span class="hljs-comment">#Output</span> (<span class="hljs-number">5.908</span>, <span class="hljs-number">29.0</span>] <span class="hljs-number">5</span> (<span class="hljs-number">75.0</span>, <span class="hljs-number">98.0</span>] <span class="hljs-number">3</span> (<span class="hljs-number">29.0</span>, <span class="hljs-number">52.0</span>] <span class="hljs-number">2</span> (<span class="hljs-number">52.0</span>, <span class="hljs-number">75.0</span>] <span class="hljs-number">0</span> Name: cut_Series1, dtype: int64

<span class="hljs-comment"># Check the data distribution of each bucket when qcut() was used</span> df[<span class="hljs-string">"qcut_Series1"</span>].value_counts()

<span class="hljs-comment">#Output</span> (<span class="hljs-number">5.999</span>, <span class="hljs-number">14.5</span>] <span class="hljs-number">3</span> (<span class="hljs-number">70.25</span>, <span class="hljs-number">98.0</span>] <span class="hljs-number">3</span> (<span class="hljs-number">14.5</span>, <span class="hljs-number">26.0</span>] <span class="hljs-number">2</span> (<span class="hljs-number">26.0</span>, <span class="hljs-number">70.25</span>] <span class="hljs-number">2</span> Name: qcut_Series1, dtype: int64 </pre></div><p id="9a67">You’ll see when you used the function <code><b>cut()</b></code>, although each bin size is equal, i.e. 23, each bin contains a different number of elements.</p><p id="5ade">Whereas, when you used the function <code><b>qcut()</b></code>, a similar number of elements were present in each bucket. But you can see such distribution came at the cost of varied bin sizes.</p><p id="8e19">So in the case of my project, the function <code>pandas.qcut()</code> was the ultimate solution as you can see here —</p><div id="91dd"><pre>df[<span class="hljs-string">"binned_hhi_qcut"</span>] = pd.qcut(df[<span class="hljs-string">"hhi"</span>], q=<span class="hljs-number">3</span>) df[<span class="hljs-string">"brand_bucket_qcut"</span>] = pd.qcut(df[<span class="hljs-string">"hhi"</span>], q=<span class="hljs-number">3</span>, labels = [<span class="hljs-string">"low"</span>, <span class="hljs-string">"medium"</span>, <span class="hljs-string">"high"</span>]) df</pre></div><figure id="03f1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Zx9rahJ20ix0vdUDb_P8iA.png"><figcaption>Using pandas.qcut() to the real-world scenario | Image by Author</figcaption></figure><p id="b832">So, <code><b>qcut()</b></code> assigned 3 brands to each of the <i>medium</i> and <i>high</i> concentration buckets and 4 brands to the <i>low</i> concentration bucket.</p><p id="1b6a">I hope you found this article refreshing and useful. Although the conversion of a continuous series into discrete ones is the common scenario in data analysis, the task can be really daunting if you don’t know the built-in functions.</p><p id="a9be">Using these functions in your data analysis projects will certainly empower you to easily extract the required information from the data in no time.</p><blockquote id="dcd4"><p>LMK in the <b>Comments</b> which topics you would like to get such amazing articles!</p></blockquote><p id="9ce0">Well, just knowing these functions is not enough — start using them in your data analysis tasks to unlock the real pandas power today.</p><p id="30a9" type="7">Ready to level up your data analysis skills?</p><p id="b760">💡 Consider <a href="https://medium.com/@17.rsuraj/membership"><b>Becoming a Medium Member</b></a> to <b>access unlimited</b> stories on Medium and daily interesting Medium digest. I will get a small portion of your fee and No additional cost to you.</p><p id="1732">💡 Be sure to <a href="https://medium.com/subscribe/@17.rsuraj"><b>Sign-up to my Email list</b></a> to never miss another article on data science guides, tricks, and tips, SQL, and Python.</p><p id="6fc1" type="7">To know more about my project, Comment your question!</p><p id="6525">Thank you for reading!</p></article></body>

Data Science

Two Interesting Pandas Data Manipulation Functions You Need to Know

Extremely useful pandas functions for converting a continuous pandas column into categorical ones.

Photo by Brendan Church on Unsplash

Python pandas is a powerful and widely used library for data analysis.

It comes up with 200+ functions and methods, making data manipulation and transformation easy. However, knowing all these functions and using them where required in the actual work isn’t a feasible task.

One of the common tasks in data manipulation is converting a column having continuous numerical values into a column containing discrete or categorical values. And pandas has two amazing built-in functions which can certainly save you a few minutes.

You can use such type of data transformation for a variety of applications like grouping data, analyzing data by discrete groups, or visualizing data using histograms.

For example,

Recently, I calculated Herfindahl-Hirschman Index (HHI) to understand the market concentration of multiple brands. So in a pandas DataFrame, I had a column with continuous values of HHI for all brands. Ultimately, I wanted to convert this column to a discrete one to categorize each brand as low, medium, and high market concentration — That’s where I got inspired for this story.

Without knowing these built-in pandas functions, you might need to write multiple if-else and for statements to get the same work done.

Therefore, here you’ll explore such 2 super-useful built-in pandas functions along with interesting examples (including my project), which will supercharge your data analysis and save you a couple of minutes.

Often you need to convert a column with continuous values into another column with discrete values in your analytics project.

So basically you categorize the continuous data into several categories, i.e. buckets or bins. And you can do so by either specifying minimum and maximum values for each bin, i.e. defining bin edges or by specifying the number of bins.

Depending on your purpose of splitting a continuous series into a discrete one, you can use one of the next two methods.

As I was curious about a built-in function for my work, first I came across the cut() function from pandas library.

pandas cut()

You can use pandas cut() when you want to split the data into a fixed number of different buckets, irrespective of the number of values in each bucket.

As per pandas official documentation, there are 7 optional parameters for the function pandas.cut() along with 2 mandatory parameters.

But you don’t need to remember all of them.

I’ve simplified things for you. I’m using this function quite often nowadays and found some of the function parameters more useful than others.

Here are the commonly used optional parameters which you’ll use in almost 90% of the cases.

pandas.cut(x,
           bins,
           labels=None,
           right=True,
           include_lowest=False)

Let’s take an example to understand how each of these parameters works.

Suppose you have the following continuous series, which you would like to convert into 5 bins.

import pandas as pd
import numpy as np

# Create random data
Series1 = pd.Series(np.random.randint(0, 100, 10))

# Create DataFrame
df = pd.DataFrame({"Series1": Series1})

# Apply pandas.cut() on the column Series1
df["binned_Series1"] = pd.cut(df["Series1"], bins=5)
pandas cut() | Image by Author

You simply assigned the integer 5 to the parameter bin — as a result, pandas split the entire column Series1 into 5 equal-sized buckets. Pandas assigned each value from Series1 to one of these 5 buckets.

If you inspect each of these buckets, you’ll see two things are common.

  1. The bin edges are non-integer — You can fix this by defining the bin edges in the bin parameter.
  2. Each bin edge is closed on the right — It is coming from the default setting of the parameter right as right=True. It means that the pandas include the maximum value of the bucket in the same bucket. This parameter specifically helps you control the binning process and switching its value helps you include or exclude certain elements from a bin.

Let’s give it a second chance.

This time you’ll pass a list of bin edges for the same DataFrame column and see how the result changes.

df["binned_Series1_defined_binedge"] = pd.cut(df["Series1"],
                                              bins=[0, 10, 15, 40, 65, 100])
pandas cut with defined bin edges | Image by Author

Pandas simply created new bins using the integers you provided in the bin parameter and assigned each number of Series1 to these bins.

Moreover, you can also use the Label parameter to give a name to each of these buckets, like below.

df["bin_name"] = pd.cut(df["Series1"],
                        bins=[0, 10, 15, 40, 65, 100],
                        labels=['bin 1', 'bin 2', 'bin 3', 'bin 4', 'bin 5'])
pandas cut() with bin labels | Image by Author

It works perfectly as expected!

Coming back to my work — a real-world scenario — I tried the function pandas.cut() on my below dataset.

# Create a sample DataFrame as I can not disclose the original data
HHI = [random.random() for i in range(10)]
Brands = ["Brand_1", "Brand_2", "Brand_3", "Brand_4", "Brand_5",
          "Brand_6", "Brand_7", "Brand_8", "Brand_9", "Brand_10"]

df = pd.DataFrame({"brand": Brands, "hhi": HHI})

# Use pandas.cut()
df["binned_hhi"] = pd.cut(df["hhi"], bins=3)
df["brand_bucket"] = pd.cut(df["hhi"], 
                            bins=3, 
                            labels = ["low", "medium", "high"])
df
Using pandas.cut() on a real-world example | Image by Author

However, the distribution of elements in each of these buckets is uneven, i.e. each bin contains a different number of elements. 5 brands belong to the low, 3 brands belong to the medium, and only 2 brands belong to the high concentration bucket.

But for my project, I wanted to keep the distribution i.e. the number of brands in each bucket same and that’s where I found the next pandas method useful.

pandas qcut()

pandas.qcut() is used to get an equal data distribution in all the bins. It works on the principle of sample quantiles.

Quantiles are the values that divide a series into a number of subsets — each containing nearly the same number of elements.

So when you cut a series using the function qcut(), it simply tells you which element`of the series belongs to which quantile.

The basic syntax of the function qcut() is almost the same as the syntax of the function cut().

Let’s understand this with an example — Here you’ll use both the functions cut() and qcut() on the same data and categorize them into 4 bins.

Series1 = pd.Series([17, 47, 35, 6, 6, 16, 78, 14, 79, 98])
df = pd.DataFrame({"Series1": Series1})

df["qcut_Series1"] = pd.qcut(df["Series1"], q=4) # Use qcut()
df["cut_Series1"] = pd.cut(df["Series1"], bins=4) # Use cut()
Quantile-based discretization Python | Image by Author

Now, when you check the data distribution in each bin —

# Check the data distribution of each bucket when cut() was used
df["cut_Series1"].value_counts()

#Output
(5.908, 29.0]    5
(75.0, 98.0]     3
(29.0, 52.0]     2
(52.0, 75.0]     0
Name: cut_Series1, dtype: int64


# Check the data distribution of each bucket when qcut() was used
df["qcut_Series1"].value_counts()

#Output
(5.999, 14.5]    3
(70.25, 98.0]    3
(14.5, 26.0]     2
(26.0, 70.25]    2
Name: qcut_Series1, dtype: int64

You’ll see when you used the function cut(), although each bin size is equal, i.e. 23, each bin contains a different number of elements.

Whereas, when you used the function qcut(), a similar number of elements were present in each bucket. But you can see such distribution came at the cost of varied bin sizes.

So in the case of my project, the function pandas.qcut() was the ultimate solution as you can see here —

df["binned_hhi_qcut"] = pd.qcut(df["hhi"], q=3)
df["brand_bucket_qcut"] = pd.qcut(df["hhi"], 
                            q=3, 
                            labels = ["low", "medium", "high"])
df
Using pandas.qcut() to the real-world scenario | Image by Author

So, qcut() assigned 3 brands to each of the medium and high concentration buckets and 4 brands to the low concentration bucket.

I hope you found this article refreshing and useful. Although the conversion of a continuous series into discrete ones is the common scenario in data analysis, the task can be really daunting if you don’t know the built-in functions.

Using these functions in your data analysis projects will certainly empower you to easily extract the required information from the data in no time.

LMK in the Comments which topics you would like to get such amazing articles!

Well, just knowing these functions is not enough — start using them in your data analysis tasks to unlock the real pandas power today.

Ready to level up your data analysis skills?

💡 Consider Becoming a Medium Member to access unlimited stories on Medium and daily interesting Medium digest. I will get a small portion of your fee and No additional cost to you.

💡 Be sure to Sign-up to my Email list to never miss another article on data science guides, tricks, and tips, SQL, and Python.

To know more about my project, Comment your question!

Thank you for reading!

Data Science
Programming
Data Analysis
Pandas
Pandas Dataframe
Recommended from ReadMedium