Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

/span> = np.random.randint(<span class="hljs-number">100</span>, size=<span class="hljs-number">7</span>)</pre></div><div id="b29e"><pre>marketing.iloc<span class="hljs-comment">[toselect, <span class="hljs-comment">[2,4,6]</span>]</span></pre></div><figure id="2791"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ETai02X0yN79uZj6Ex_GqQ.png"><figcaption>(image by author)</figcaption></figure><p id="bae0">We have created a numpy array of 7 random integers between 0 and 100. We have passed this array to the iloc function along with a list of 3 columns to be selected.</p><h2 id="916f">15. Selecting rows and columns by label</h2><p id="6bd0">The loc function is just like the iloc function but it accepts labels instead of indices.</p><figure id="e441"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*RfA-_61ifAuZT66GuVQXYw.png"><figcaption>(image by author)</figcaption></figure><p id="492c">I have replicated the selection in the previous example to point out the difference between the loc and iloc functions. You may have noticed that the same array is used for the row part. The reason is that the labels and indices of the rows are the same unless we assign different labels for rows.</p><h2 id="b09b">16. Extracting the year and month from dates</h2><p id="768d">Pandas provides lots of functions to operate on the dates. They are used through the dt accessor.</p><p id="f41f">We can easily extract the year and month from dates as follows:</p><div id="ddf6"><pre>groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Year'</span>]</span> = groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Date'</span>]</span><span class="hljs-selector-class">.dt</span><span class="hljs-selector-class">.year</span> groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Month'</span>]</span> = groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Date'</span>]</span><span class="hljs-selector-class">.dt</span>.month</pre></div><figure id="f01f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*2HTkOg7Y4bHJ_B_RG0zMtA.png"><figcaption>(image by author)</figcaption></figure><h2 id="f5ef">17. Dropping columns and rows</h2><p id="678d">In the previous example, we created two new columns. By default, pandas add the new columns at the end of a dataframe but we can change it.</p><p id="643e">We will add the new columns at a specific position in the next example. However, we first need to drop them which can be done by using the drop function.</p><div id="0632"><pre>groceries.drop([<span class="hljs-string">'Year'</span>,<span class="hljs-string">'Month'</span>], <span class="hljs-attribute">axis</span>=1, <span class="hljs-attribute">inplace</span>=<span class="hljs-literal">True</span>)</pre></div><p id="5721">We pass the list of columns or rows to be dropped. The axis parameter needs to be 1 to drop columns and 0 to drop rows.</p><h2 id="190f">18. Inserting a column</h2><p id="cdee">The year and month columns might look better if they are placed before the date column. We can use the insert function to accomplish this task.</p><div id="914c"><pre>year = groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Date'</span>]</span><span class="hljs-selector-class">.dt</span><span class="hljs-selector-class">.year</span> month = groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Date'</span>]</span><span class="hljs-selector-class">.dt</span>.month</pre></div><div id="d8fd"><pre>groceries.<span class="hljs-built_in">insert</span>(<span class="hljs-number">1</span>, <span class="hljs-string">'Month'</span>, <span class="hljs-built_in">month</span>) groceries.<span class="hljs-built_in">insert</span>(<span class="hljs-number">2</span>, <span class="hljs-string">'Year'</span>, <span class="hljs-built_in">year</span>)</pre></div><figure id="302c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*uvOBdXpYUbHX3MdhH7PZtw.png"><figcaption>(image by author)</figcaption></figure><h2 id="f6a3">19. Replacing values</h2><p id="d784">In the previous example, we created a month column that contains numbers to represent months. You may want this column to contain the names of months (i.e. January, February, and so on).</p><p id="3058">There are multiple ways to do this operation. I will first show you the harder way. In the next example, we will see a much simpler method.</p><p id="b575">We can use the replace function to replace integers with strings of month names.</p><div id="958e"><pre><span class="hljs-attribute">month_names</span> = {1:<span class="hljs-string">'January'</span>, 2:<span class="hljs-string">'February'</span>, 3:<span class="hljs-string">'March'</span>, 4:<span class="hljs-string">'April'</span>, 5: <span class="hljs-string">'May'</span>, 6:<span class="hljs-string">'June'</span>, 7:<span class="hljs-string">'July'</span>, 8:<span class="hljs-string">'August'</span>, 9:<span class="hljs-string">'September'</span>, 10:<span class="hljs-string">'October'</span>, 11:<span class="hljs-string">'November'</span>, 12:<span class="hljs-string">'December'</span>}</pre></div><div id="9601"><pre>groceries.<span class="hljs-built_in">Month</span>.<span class="hljs-built_in">replace</span>(month_names, inplace=<span class="hljs-literal">True</span>)</pre></div><p id="9ca1">We have created a dictionary that indicates the replacements and then passed it to the replace function.</p><h2 id="5d94">20. Month name</h2><p id="2f56">There is much simpler way of doing the task in the previous step. We will make a function of the dt accessor.</p><div id="2e4e"><pre>groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Month'</span>]</span> = groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Date'</span>]</span><span class="hljs-selector-class">.dt</span><span class="hljs-selector-class">.month_name</span>()</pre></div><figure id="f447"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vWw19EytvY2CNLvruPb_eQ.png"><figcaption>(image by author)</figcaption></figure><p id="5ae3">We can directly retrieve the month name from the date column. It is important to note that the functions of dt accessor can only be used with datetime like values.</p><h2 id="2350">21. The cumulative sum</h2><p id="af97">The cumsum function allows to create a column based on the cumulative sum of another column. Consider the marketing dataframe. We can create a column that contains the cumulative sum of the spent amount.</p><div id="5fdd"><pre>marketing<span class="hljs-selector-attr">[<span class="hljs-string">'CumAmountSpent'</span>]</span> = marketing<span class="hljs-selector-attr">[<span class="hljs-string">'AmountSpent'</span>]</span><span class="hljs-selector-class">.cumsum</span>()</pre></div><figure id="ecce"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Ps3jgKX2eQNGryoABSt4dg.png"><figcaption>(image by author)</figcaption></figure><h2 id="45cb">22. Filtering strings</h2><p id="c12c">In the previous example, we have mentioned the dt accessor which makes it very easy and simple to deal with dates. Similarly, the str accessor provides many functions and methods that expedite to process textual data.</p><p id="791f">For instance, we can check if strings contain a specific set of characters. A typical use case would be to count the number of rows that contains the word “milk” in the description column of the groceries table.</p><div id="5328"><pre>groceries<span class="hljs-selector-class">.itemDescription</span><span class="hljs-selector-class">.str</span><span class="hljs-selector-class">.contains</span>(<span class="hljs-string">'milk'</span>)<span class="hljs-selector-class">.sum</span>() <span class="hljs-number">3186</span></pre></div><div id="faa3"><pre>groceries<span class="hljs-selector-class">.itemDescription</span><span class="hljs-selector-class">.str</span><span class="hljs-selector-class">.contains</span>(<span class="hljs-string">'whole milk'</span>)<span class="hljs-selector-class">.sum</span>() <span class="hljs-number">2502</span></pre></div><p id="b054">The contains function returns true if a value contains the given string. By applying the sum function (1 for each true value), we calculate the total number of rows that contains the word “milk”.</p><h2 id="2335">23. Filtering strings based on length</h2><p id="524c">We can also filter string based on the length (i.e. number of characters). Let us find the items with long descriptions.</p><div id="f132"><pre>groceries<span class="hljs-selector-attr">[groceries.itemDescription.str.len() > 20]</span>
<span class="hljs-selector-class">.itemDescription</span><span class="hljs-selector-class">.unique</span>()</pre></div><div id="97f9"><pre><span class="hljs-built_in">array</span>([<span class="hljs-symbol">'fruit</span>/vegetable juice', <span class="hljs-symbol">'packaged</span> fruit/vegetables', <span class="hljs-symbol">'frozen</span> potato products', <span class="hljs-symbol">'Instant</span> food products', <span class="hljs-symbol">'female</span> sanitary products', <span class="hljs-symbol">'house</span> keeping products', <span class="hljs-symbol">'chocolate</span> marshmallow', <span class="hljs-symbol">'long</span> life bakery product', <span class="hljs-symbol">'flower</span> soil/fertilizer', <span class="hljs-symbol">'preservation</span> products'], dtype=object)</pre></div><p id="0fe9">The filtering is on the item description column and the descriptions that are longer than 20 characters are selected.</p><h2 id="0cf8">24. Plotting the distribution of a variable</h2><p id="e544">Pandas is not a data visualization library so it is not optimized for visualization tasks. However, it provides plotting functions which I think make it highly convenient to produce basic plots.</p><p id="e9c7">For instance, we can create a kde plot to see the distribution of the salary column.</p><div id="2333"><pre>marketing.Salary.plot(<span class="hljs-attribute">kind</span>=<span class="hljs-string">'kde'</span>, <span class="hljs-attribute">title</span>=<span class="hljs-string">'Distribution of Salary'</span>, figsize=(10,6))</pre></div><figure id="6d06"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5F-avXA1QQwlvQfikabn4Q.png"><figcaption>(image by author)</figcaption></figure><h2 id="6152">25. Creating a histogram</h2><p id="a4c7">Histograms are also commonly used to check the distribution of a numerical feature. We can use the plot function to produce histograms as well.</p><div id="6a09"><pre>marketing.Salary.plot(<span class="hljs-attribute">kind</span>=<span class="hljs-string">'hist'</span>, <span class="hljs-attribute">title</span>=<span class="hljs-string">'Distribution of Salary'</span>, figsize=(10,6))</pre></div><figure id="e8fb"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*i6KxDHbhFiJ3hZHcPwo94g.png"><figcaption>(image by author)</figcaption></figure><h2 id="ec88">26. Trend in the monthly sales</h2><p id="fa9c">In this example, we will combine a few operations to create a plot that shows the trend in monthly sales. The first step is to create a month column as we did previously.</p><div id="2dba"><pre>groceries<span class="hljs-selector-attr">[<span class="hljs-string">'month_name'</span>]</span> = groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Date'</span>]</span><span class="hljs-selector-class">.dt</span><span class="hljs-selector-class">.month_name</span>()</pre></div><p id="4b8f">We will calculate the number of items sold in each month by using the group by function and then plot the values.</p><div id="390c"><pre>groceries<span class="hljs-selector-attr">[[<span class="hljs-string">'month_name'</span>,<span class="hljs-string">'Date'</span>]</span>]<span class="hljs-selector-class">.groupby</span>(<span class="hljs-string">'month_name'</span>)
<span class="hljs-selector-class">.count</span>()<span class="hljs-selector-class">.plot</span>(title=<span class="hljs-string">"Monthly Sales"</span>, figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">6</span>))</pre></div><figure id="c779"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*EtI8O5uGYDb7eK9tOGmH0g.png"><figcaption>(image by author)</figcaption></figure><p id="0b80">You may have noticed that we did not use the kind parameter of the plot function. The reason is that the default value of kind parameter produces a line plot which is what we need in our case.</p><h2 id="5410">27. Different aggregate funtions to different columns</h2><p id="d560">It is possible to apply different aggregate functions to different columns in the group by function. We can pass a dictionary to indicate which functions will be applied to which columns.</p><div id="c8c3"><pre>marketing<span class="hljs-selector-attr">[[<span class="hljs-string">'Married'</span>,<span class="hljs-string">'Salary'</span>,<span class="hljs-string">'AmountSpent'</span>]</span>]<span class="hljs-selector-class">.groupby</span>(<span class="hljs-selector-attr">[<span class="hljs-string">'Married'</span>]</span>)
<span class="hljs-selector-class">.agg</span>({<span class="hljs-string">'Salary'</span>:<span class="hljs-string">'mean'</span>, <span class="hljs-string">'AmountSpent'</span>:<span class="hljs-string">'sum'</span>})</pre></div><figure id="dfb4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*pYzPuvSA5L-pE3GoypOLwQ.png"><figcaption>(image by author)</figcaption></figure><p id="d120">We have calculated the average salary and total spent amount by each group in the married column. However, it would be better if we also somehow indicate which functions are applied to each column.</p><p id="63f4">The solution is the NamedAgg method.</p><h2 id="c2df">28. NamedAgg in group by</h2><p id="d816">We will do the same operation as in the previous example but only change the column names in the result.</p><div id="b526"><pre>marketing<span class="hljs-selector-attr">[[<span class="hljs-string">'Married'</span>,<span class="hljs-string">'Salary'</span>,<span class="hljs-string">'AmountSpent'</span>]</span>]<span class="hljs-selecto

Options

r-class">.groupby</span>(<span class="hljs-selector-attr">[<span class="hljs-string">'Married'</span>]</span>)
<span class="hljs-selector-class">.agg</span>( Average_salary = pd<span class="hljs-selector-class">.NamedAgg</span>(<span class="hljs-string">'Salary'</span>, <span class="hljs-string">'mean'</span>), Total_spent = pd<span class="hljs-selector-class">.NamedAgg</span>(<span class="hljs-string">'AmountSpent'</span>, <span class="hljs-string">'sum'</span>) )</pre></div><figure id="cd21"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Ax25V7wpgKP8VB3I0Yfv_w.png"><figcaption>(image by author)</figcaption></figure><h2 id="bdca">29. Crosstab function</h2><p id="28ff">The cross tab function is used to create a cross table based on specified columns, values, and aggregate functions. It is similar to a pivot table.</p><p id="6652">For instance, we can calculate the average salary of cross categories between the age and gender columns.</p><div id="beff"><pre>pd.crosstab(<span class="hljs-attribute">index</span>=marketing.Age, <span class="hljs-attribute">columns</span>=marketing.Gender, <span class="hljs-attribute">values</span>=marketing.Salary, <span class="hljs-attribute">aggfunc</span>=<span class="hljs-string">'mean'</span>).round(1)</pre></div><figure id="8d26"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*O-PyKlYeAkSYrERRBYoYhA.png"><figcaption>(image by author)</figcaption></figure><p id="5534">The middle aged males have the highest average salary.</p><h2 id="dfa6">30. Crosstab function — 2</h2><p id="45cd">We will do a slightly more complex example with the crosstab function. We can pass multiple columns and also display the overall values.</p><div id="6be5"><pre>pd.crosstab(index=[marketing.Age, marketing.Married], <span class="hljs-attribute">columns</span>=marketing.Gender,values=marketing.Salary, <span class="hljs-attribute">aggfunc</span>=<span class="hljs-string">'mean'</span>, <span class="hljs-attribute">margins</span>=<span class="hljs-literal">True</span>).round(1)</pre></div><figure id="a610"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*SdqRMcHS1aze5VvqUwRrJA.png"><figcaption>(image by author)</figcaption></figure><p id="a70d">This cross table is more informative than the previous one as it includes more specific categories and overall average values.</p><h2 id="8c5c">31. Pivot_table function</h2><p id="55e8">It is extremely similar to the crosstable function with a few small differences in the syntax. I will create the same table as in the previous example using the pivot_table function.</p><div id="4113"><pre>pd.pivot_table(<span class="hljs-attribute">data</span>=marketing, index=[<span class="hljs-string">'Age'</span>, <span class="hljs-string">'Married'</span>], <span class="hljs-attribute">columns</span>=<span class="hljs-string">'Gender'</span>, <span class="hljs-attribute">values</span>=<span class="hljs-string">'Salary'</span>, <span class="hljs-attribute">aggfunc</span>=<span class="hljs-string">'mean'</span>, <span class="hljs-attribute">margins</span>=<span class="hljs-literal">True</span>).round(1)</pre></div><figure id="e768"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*oxAnPL7VvX6EHf9ILWhtjQ.png"><figcaption>(image by author)</figcaption></figure><p id="c2c7">We can pass the dataframe to the data parameter and use the column names as strings.</p><h2 id="49df">32. Splitting strings</h2><p id="6d4d">The string accessor can be used to split or combine strings. For instance, we can split the parts of the date in the groceries dataframe to obtain day, month, and year values.</p><p id="15c9">Please note that the data type should be object or string to be able apply str accessor.</p><p id="98a1"><b>Note</b>: If you have the date column stored with datetime64[ns] data type, convert it back to “string” data type in order to apply the following splitting operation. Converting back to “object” will not allow to use str accessor for some reason. However, if you have it stored as “object” or “string” at the first place, you can apply the str accessor.</p><div id="9704"><pre>groceries<span class="hljs-selector-attr">[<span class="hljs-string">'month'</span>]</span> = groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Date'</span>]</span>
<span class="hljs-selector-class">.str</span><span class="hljs-selector-class">.split</span>(<span class="hljs-string">'-'</span>, expand=True)<span class="hljs-selector-attr">[1]</span></pre></div><p id="0d71">We have splitted the date colum at “-” characted. The expand parameter is set as true to create a different column for each part. We have selected the second column ([1]) which is the month.</p><figure id="388e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ufD_Vc5cIAd_68uX6y2H5A.png"><figcaption>(image by author)</figcaption></figure><h2 id="a232">33. Splitting strings on character level</h2><p id="cd11">We can select part of strings based on the position of characters. Consider the previous example. We may want to retrieve the last two characters of the years (e.g. 15 instead of 2015).</p><p id="3dfb">The str accessor allows indexing on strings.</p><div id="74ac"><pre>groceries<span class="hljs-selector-attr">[<span class="hljs-string">'year'</span>]</span> = groceries<span class="hljs-selector-attr">[<span class="hljs-string">'Date'</span>]</span>
<span class="hljs-selector-class">.str</span><span class="hljs-selector-class">.split</span>(<span class="hljs-string">'-'</span>, expand=True)<span class="hljs-selector-attr">[2]</span><span class="hljs-selector-class">.str</span><span class="hljs-selector-attr">[-2:]</span></pre></div><figure id="13d9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*4g0JUOlXtRRTEScaTH27pQ.png"><figcaption>(image by author)</figcaption></figure><h2 id="151a">34. Sidetable</h2><p id="89a2"><a href="https://github.com/chris1610/sidetable">Sidetable</a> is an add-on for Pandas which makes it easier to create summaries of dataframes. It can be considered as a combination of value counts and cross tab functions.</p><p id="b7a0">Once installed, it can be used as other accessors such as str and dt.</p><div id="de63"><pre><span class="hljs-title">pip</span> install sidetable <span class="hljs-keyword">import</span> sidetable</pre></div><div id="9993"><pre>groceries.stb.fre<span class="hljs-string">q(['itemDescription'], thresh=25)</span></pre></div><figure id="c51d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*0oHke4-1SpMpLoBBVf5HtA.png"><figcaption>(image by author)</figcaption></figure><p id="ea51">Freq function returns a dataframe that conveys 3 pieces of information.</p><ul><li>The number of observations (i.e. rows) for each category (value_counts()).</li><li>The percentage of each category in the entire column (value_counts(normalize=True)).</li><li>The cumulative versions of the two above.</li></ul><p id="b3ba">Sidetable offers more functionality. I wrote a detailed <a href="https://towardsdatascience.com/pandas-sidetable-a-smarter-way-of-using-pandas-96fa7c61e514">article</a> about sidetable if you’d like to read further.</p><h2 id="e7d6">35. Finding missing values</h2><p id="09ea">Missing values need to be handled very carefully in order to make accurate and robust analysis.</p><p id="28b8">The isna function can be used the find the missing values in a dataframe. It returns true if the value is missing. Thus, we can count the total number of missing values by applying the sum function.</p><div id="ffab"><pre>groceries<span class="hljs-selector-class">.isna</span>()<span class="hljs-selector-class">.sum</span>()</pre></div><div id="446d"><pre><span class="hljs-attribute">Member_number</span> <span class="hljs-number">0</span> <span class="hljs-attribute">Date</span> <span class="hljs-number">0</span> <span class="hljs-attribute">itemDescription</span> <span class="hljs-number">0</span></pre></div><p id="08cb">We do not have any missing values in the groceries dataframe.</p><h2 id="4531">36. Handling missing values</h2><p id="1667">The fillna function can be used to handle missing values. It provides many options to fill missing values such as mean, median, or a constant value.</p><p id="8436">We can also use the previous or next value to fill a missing value.</p><p id="316b">Let us first change a few values as missing value in the groceries dataframe.</p><div id="ce54"><pre>groceries.iloc<span class="hljs-string">[[1,10,30], [1,2]]</span> = np.nan</pre></div><div id="9c22"><pre><span class="hljs-attribute">groceries</span>.isna().sum() <span class="hljs-attribute">Member_number</span> <span class="hljs-number">0</span> <span class="hljs-attribute">Date</span> <span class="hljs-number">3</span> <span class="hljs-attribute">itemDescription</span> <span class="hljs-number">3</span></pre></div><p id="30bd">We can use the most frequent item to fill missing values in the item description column. For the date column, we will use the previous value to replace a missing value.</p><div id="4ac8"><pre>groceries[<span class="hljs-string">'itemDescription'</span>]
.fillna(<span class="hljs-attribute">value</span>=groceries[<span class="hljs-string">'itemDescription'</span>].mode()[0], <span class="hljs-attribute">inplace</span>=<span class="hljs-literal">True</span>)</pre></div><div id="bae8"><pre>groceries[<span class="hljs-string">'Date'</span>].fillna(<span class="hljs-keyword">method</span>='<span class="hljs-title function_">ffill</span>', <span class="hljs-title function_">inplace</span>=<span class="hljs-title function_">True</span>)</pre></div><div id="0fa2"><pre><span class="hljs-attribute">groceries</span>.isna().sum() <span class="hljs-attribute">Member_number</span> <span class="hljs-number">0</span> <span class="hljs-attribute">Date</span> <span class="hljs-number">0</span> <span class="hljs-attribute">itemDescription</span> <span class="hljs-number">0</span></pre></div><h2 id="c0b4">37. Selecting data types</h2><p id="52b6">The select_dtypes function can be used select columns that belong or does not belong to a particular data type.</p><div id="6fd6"><pre>marketing<span class="hljs-selector-class">.select_dtypes</span>(include=<span class="hljs-string">'object'</span>)<span class="hljs-selector-class">.columns</span> <span class="hljs-function"><span class="hljs-title">Index</span><span class="hljs-params">([<span class="hljs-string">'Age'</span>, <span class="hljs-string">'Gender'</span>, <span class="hljs-string">'OwnHome'</span>, <span class="hljs-string">'Married'</span>, <span class="hljs-string">'Location'</span>, <span class="hljs-string">'History'</span>], dtype=<span class="hljs-string">'object'</span>)</span></span></pre></div><div id="1954"><pre>marketing<span class="hljs-selector-class">.select_dtypes</span>(exclude=<span class="hljs-string">'object'</span>)<span class="hljs-selector-class">.columns</span> <span class="hljs-function"><span class="hljs-title">Index</span><span class="hljs-params">([<span class="hljs-string">'Salary'</span>, <span class="hljs-string">'Children'</span>, <span class="hljs-string">'Catalogs'</span>, <span class="hljs-string">'AmountSpent'</span>], dtype=<span class="hljs-string">'object'</span>)</span></span></pre></div><p id="2290">We can include or exclude certain data types.</p><h2 id="eff9">38. Creating a dataframe</h2><p id="2d43">The DataFrame function can be used to create a dataframe. A dictionary can be passed to the DataFrame function. The keys will be the column names and the values will represent the row values.</p><p id="2629">Let’s create a dataframe that contains the prices of the items in the groceries dataframe.</p><div id="ccc6"><pre><span class="hljs-attr">unique_items</span> = groceries.itemDescription.unique()</pre></div><div id="edde"><pre>prices = pd<span class="hljs-selector-class">.DataFrame</span>({ <span class="hljs-string">'itemDescription'</span>: unique_items, <span class="hljs-string">'prices'</span>:np<span class="hljs-selector-class">.random</span><span class="hljs-selector-class">.randint</span>(<span class="hljs-number">10</span>, size=<span class="hljs-built_in">len</span>(unique_items)) })</pre></div><figure id="100a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*kngZBetwwzSI_A22D4_8Og.png"><figcaption>(image by author)</figcaption></figure><p id="99b5">We assign the prices randomly by creating a numpy array of random integers between 0 and 10.</p><h2 id="9887">39. Merging dataframes</h2><p id="3034">The merge function can be used to merge two dataframes based on a shared column or columns. For instance, we can merge the groceries and price dataframes based on the item description column.</p><div id="49c3"><pre><span class="hljs-attr">merged_df</span> = groceries.merge(prices, <span class="hljs-literal">on</span>=<span class="hljs-string">'itemDescription'</span>)</pre></div><figure id="8b14"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Y4bmsaQey96DDwuu0hwIxw.png"><figcaption>(image by author)</figcaption></figure><h2 id="b442">40. Correlations</h2><p id="cb0d">When working on a machine learning task, the correlations between numerical variables need to be taken into consideration.</p><p id="3ae7">The corr function calculates the correlations and returns a matrix that contains correlation coefficients between variables.</p><figure id="55e1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*WW1OJ2atjcbjhEMNRDO2cw.png"><figcaption>(image by author)</figcaption></figure><p id="1f17">As we can see, the salary and spent amount is highly correlated.</p><h1 id="5fb8">Conclusion</h1><p id="871e">In this article and the <a href="https://towardsdatascience.com/30-examples-to-master-pandas-f8a2da751fa4">previous</a> one, we have covered a great deal of the functions and methods of Pandas.</p><p id="0f67">As you keep using pandas for your data analysis tasks, you may discover new functions and methods. As with any other subject, practice makes perfect.</p><p id="633f">Thank you for reading. Please let me know if you have any feedback.</p></article></body>

40 Examples to Master Pandas

A comprehensive practical guide

Pandas is one of the most widely-used data analysis and manipulation libraries. It provides numerous functions and methods to clean, process, manipulate, and analyze data.

The best way to get comfortable working with Pandas is through practice. I previously wrote a practical guide that contains 30 examples.

In this article, I will enrich the examples to cover a broader scope together with the previous article. 40 examples in this article will include not only the basic functions and techniques but also some extreme cases.

Most of the examples include the functions and methods that were not discussed in the previous article. The few examples that cover the same functions are the ones that I want to emphasize and explain again with a different example.

We will be using a marketing and a grocery data set to do the examples. The first example is reading the csv files into Pandas dataframes.

1. Reading csv files

The read_csv function provides flexible ways for reading csv files into Pandas dataframes.

import numpy as np
import pandas as pd

marketing = pd.read_csv("/content/DirectMarketing.csv")
groceries = pd.read_csv("/content/Groceries_dataset.csv")

The first five rows of the marketing dataframe (image by author)

The first five rows of the groceries dataframe (image by author)

2. Changing data type with astype

The dates need to be stored in the datetime data type in order to use the datetime functions of Pandas. Let’s check the data type of the columns of the groceries dataframe.

groceries.dtypes

Member_number       int64
Date               object
itemDescription    object

As you can see, the data type of the date column is object. We can change it using the astype function.

groceries['Date'] = groceries['Date'].astype("datetime64")

groceries.dtypes

Member_number               int64
Date               datetime64[ns]
itemDescription            object

3. Changing the data type with to_datetime

We can also use the to_datetime function to assign appropriate data types for dates. The syntax is a little different than the astype function.

groceries['Date'] = pd.to_datetime(groceries['Date'])

4. Parsing dates

In the first examples, I mentioned that the read_csv function is quite flexible at reading the csv files. It can also handle the dates. We can assign appropriate data types for dates while reading the data. It will save us from having to change the data type later on.

groceries = pd.read_csv("/content/Groceries_dataset.csv", parse_dates=['Date'])

groceries.dtypes

Member_number               int64
Date               datetime64[ns]
itemDescription            object

5. Filtering with the isin method

There are many ways to filter a dataframe based on the values. We can use logical operators such as equal (==), not equal (!=), or greater than (>).

The isin method allows to filter based on a specific set of values. We can just pass a list of the values we want to filter.

groceries[groceries.Member_number.isin([3737, 2433, 3915, 2625])].shape

(126, 3)

There are 126 entries that belong to the customers whose member number is given in the list.

6. Tilde operator

The tilde (~) operator can be used as “not” while applying filters. For instance, we can find the complement of the filtered rows in the previous example by just adding the tilde operator at the beginning.

groceries[~groceries.Member_number.isin([3737, 2433, 3915, 2625])].shape

(38639, 3)

7. Value counts with normalization

The value_counts is one of most frequently used functions. It counts the number of occurrences of each value and returns a series. If it is used with the normalize parameter, we get an overview of the percentage of the occurrences.

marketing.Catalogs.value_counts(normalize=True)

12    0.282
6     0.252
24    0.233
18    0.233

The most frequent value in the catalogs column is 12 which occupies about 28 percent of the entire column.

8. Setting a column as index

Pandas assigns integer index to dataframes by default but we can change it to any column using the set_index function.

For instance, we can set the date column as the index of the groceries dataframe.

groceries.set_index('Date', inplace=True)

9. Resetting the index

When some rows are dropped, Pandas does not automatically reset the index. Similarly, when two dataframes are concatenated, the indices will not be reset. In such cases, the new dataframes will not have consecutive index values.

We can use the reset_index function in those cases. I have dropped some rows of the groceries dataframe:

As you can see, some indices are skipped. We can now use the reset_index function.

groceries.reset_index(drop=True, inplace=True)

The order of values are the same but the index is reset. The drop parameter is important. If we do not set it as True, the old index will be kept as a new column in the dataframe. The inplace parameter ensures the changes are saved.

10. The unique values

The unique function returns an array of the unique values in a column.

groceries['itemDescription'].unique()[:5]

array(['tropical fruit', 'whole milk', 'pip fruit', 'other vegetables','rolls/buns'], dtype=object)

I have only displayed the first 5 elements for demonstration purposes.

11. The number of unique values

If we are only interested in the number of unique values, we can use the nunique function. It can be called on the entire dataframe or a particular column.

groceries.nunique()

Member_number      3898
Date                728
itemDescription     167

There are other ways to count the number of unique values in a column. For instance, the length of the array returned by the unique function gives us the number of unique values.

11. Creating a random sample of larger size

The sample function can be used to create a random sample of the rows of a dataframe. It comes in handy when working with unbalanced datasets in machine learning.

We can only create samples that are smaller than the original one unless the replace parameter is changed to true. The replace parameter allows for using the same row more than once.

Let us create a random sample of the groceries dataframe by only using the rows with a spent amount of less than 300.

less = marketing[marketing.AmountSpent < 300].sample(n=400, replace=True)

less.shape
(400, 10)

12. Combining dataframes

We can concatenate dataframes horizontally or vertically with the concat function. The axis parameter is used to determine the axis through which the concatenation occurs.

We can concatenate the marketing and sample dataframes we created in the previous examples.

less.shape, marketing.shape
((400, 10), (1000, 10))

new = pd.concat([marketing, less])

new.shape
(1400, 10)

The default values of the axis parameter is 0 which means concatenating along index. The dataframes must have the same number of columns.

13. Selecting a range of rows and columns by index

We can select a range of rows and columns by using the iloc function. It accepts the indices of the desired rows and columns. The pandas dataframes have integer index for both rows and columns.

For instance, we can select the first 4 rows and the first 3 columns as follows:

14. Selecting specific rows and columns by index

The iloc function also accepts an array of values instead of ranges. We can pass a list or numpy array.

toselect = np.random.randint(100, size=7)

marketing.iloc[toselect, [2,4,6]]

We have created a numpy array of 7 random integers between 0 and 100. We have passed this array to the iloc function along with a list of 3 columns to be selected.

15. Selecting rows and columns by label

The loc function is just like the iloc function but it accepts labels instead of indices.

I have replicated the selection in the previous example to point out the difference between the loc and iloc functions. You may have noticed that the same array is used for the row part. The reason is that the labels and indices of the rows are the same unless we assign different labels for rows.

16. Extracting the year and month from dates

Pandas provides lots of functions to operate on the dates. They are used through the dt accessor.

We can easily extract the year and month from dates as follows:

groceries['Year'] = groceries['Date'].dt.year
groceries['Month'] = groceries['Date'].dt.month

17. Dropping columns and rows

In the previous example, we created two new columns. By default, pandas add the new columns at the end of a dataframe but we can change it.

We will add the new columns at a specific position in the next example. However, we first need to drop them which can be done by using the drop function.

groceries.drop(['Year','Month'], axis=1, inplace=True)

We pass the list of columns or rows to be dropped. The axis parameter needs to be 1 to drop columns and 0 to drop rows.

18. Inserting a column

The year and month columns might look better if they are placed before the date column. We can use the insert function to accomplish this task.

year = groceries['Date'].dt.year
month = groceries['Date'].dt.month

groceries.insert(1, 'Month', month)
groceries.insert(2, 'Year', year)

19. Replacing values

In the previous example, we created a month column that contains numbers to represent months. You may want this column to contain the names of months (i.e. January, February, and so on).

There are multiple ways to do this operation. I will first show you the harder way. In the next example, we will see a much simpler method.

We can use the replace function to replace integers with strings of month names.

month_names = {1:'January', 2:'February', 3:'March', 4:'April',
5: 'May', 6:'June', 7:'July', 8:'August', 9:'September',
10:'October', 11:'November', 12:'December'}

groceries.Month.replace(month_names, inplace=True)

We have created a dictionary that indicates the replacements and then passed it to the replace function.

20. Month name

There is much simpler way of doing the task in the previous step. We will make a function of the dt accessor.

groceries['Month'] = groceries['Date'].dt.month_name()

We can directly retrieve the month name from the date column. It is important to note that the functions of dt accessor can only be used with datetime like values.

21. The cumulative sum

The cumsum function allows to create a column based on the cumulative sum of another column. Consider the marketing dataframe. We can create a column that contains the cumulative sum of the spent amount.

marketing['CumAmountSpent'] = marketing['AmountSpent'].cumsum()

22. Filtering strings

In the previous example, we have mentioned the dt accessor which makes it very easy and simple to deal with dates. Similarly, the str accessor provides many functions and methods that expedite to process textual data.

For instance, we can check if strings contain a specific set of characters. A typical use case would be to count the number of rows that contains the word “milk” in the description column of the groceries table.

groceries.itemDescription.str.contains('milk').sum()
3186

groceries.itemDescription.str.contains('whole milk').sum()
2502

The contains function returns true if a value contains the given string. By applying the sum function (1 for each true value), we calculate the total number of rows that contains the word “milk”.

23. Filtering strings based on length

We can also filter string based on the length (i.e. number of characters). Let us find the items with long descriptions.

groceries[groceries.itemDescription.str.len() > 20]\
.itemDescription.unique()

array(['fruit/vegetable juice', 'packaged fruit/vegetables',
       'frozen potato products', 'Instant food products',
       'female sanitary products', 'house keeping products',
       'chocolate marshmallow', 'long life bakery product',
       'flower soil/fertilizer', 'preservation products'], dtype=object)

The filtering is on the item description column and the descriptions that are longer than 20 characters are selected.

24. Plotting the distribution of a variable

Pandas is not a data visualization library so it is not optimized for visualization tasks. However, it provides plotting functions which I think make it highly convenient to produce basic plots.

For instance, we can create a kde plot to see the distribution of the salary column.

marketing.Salary.plot(kind='kde', title='Distribution of Salary',
figsize=(10,6))

25. Creating a histogram

Histograms are also commonly used to check the distribution of a numerical feature. We can use the plot function to produce histograms as well.

marketing.Salary.plot(kind='hist', title='Distribution of Salary',
figsize=(10,6))

26. Trend in the monthly sales

In this example, we will combine a few operations to create a plot that shows the trend in monthly sales. The first step is to create a month column as we did previously.

groceries['month_name'] = groceries['Date'].dt.month_name()

We will calculate the number of items sold in each month by using the group by function and then plot the values.

groceries[['month_name','Date']].groupby('month_name')\
.count().plot(title="Monthly Sales", figsize=(10,6))

You may have noticed that we did not use the kind parameter of the plot function. The reason is that the default value of kind parameter produces a line plot which is what we need in our case.

27. Different aggregate funtions to different columns

It is possible to apply different aggregate functions to different columns in the group by function. We can pass a dictionary to indicate which functions will be applied to which columns.

marketing[['Married','Salary','AmountSpent']].groupby(['Married'])\
.agg({'Salary':'mean', 'AmountSpent':'sum'})

We have calculated the average salary and total spent amount by each group in the married column. However, it would be better if we also somehow indicate which functions are applied to each column.

The solution is the NamedAgg method.

28. NamedAgg in group by

We will do the same operation as in the previous example but only change the column names in the result.

marketing[['Married','Salary','AmountSpent']].groupby(['Married'])\
.agg(
    Average_salary = pd.NamedAgg('Salary', 'mean'),
    Total_spent = pd.NamedAgg('AmountSpent', 'sum')
)

29. Crosstab function

The cross tab function is used to create a cross table based on specified columns, values, and aggregate functions. It is similar to a pivot table.

For instance, we can calculate the average salary of cross categories between the age and gender columns.

pd.crosstab(index=marketing.Age, columns=marketing.Gender, values=marketing.Salary, aggfunc='mean').round(1)

The middle aged males have the highest average salary.

30. Crosstab function — 2

We will do a slightly more complex example with the crosstab function. We can pass multiple columns and also display the overall values.

pd.crosstab(index=[marketing.Age, marketing.Married], columns=marketing.Gender,values=marketing.Salary, aggfunc='mean',
margins=True).round(1)

This cross table is more informative than the previous one as it includes more specific categories and overall average values.

31. Pivot_table function

It is extremely similar to the crosstable function with a few small differences in the syntax. I will create the same table as in the previous example using the pivot_table function.

pd.pivot_table(data=marketing, index=['Age', 'Married'], columns='Gender', values='Salary', aggfunc='mean',
margins=True).round(1)

We can pass the dataframe to the data parameter and use the column names as strings.

32. Splitting strings

The string accessor can be used to split or combine strings. For instance, we can split the parts of the date in the groceries dataframe to obtain day, month, and year values.

Please note that the data type should be object or string to be able apply str accessor.

Note: If you have the date column stored with datetime64[ns] data type, convert it back to “string” data type in order to apply the following splitting operation. Converting back to “object” will not allow to use str accessor for some reason. However, if you have it stored as “object” or “string” at the first place, you can apply the str accessor.

groceries['month'] = groceries['Date']\
.str.split('-', expand=True)[1]

We have splitted the date colum at “-” characted. The expand parameter is set as true to create a different column for each part. We have selected the second column ([1]) which is the month.

33. Splitting strings on character level

We can select part of strings based on the position of characters. Consider the previous example. We may want to retrieve the last two characters of the years (e.g. 15 instead of 2015).

The str accessor allows indexing on strings.

groceries['year'] = groceries['Date']\
.str.split('-', expand=True)[2].str[-2:]

34. Sidetable

Sidetable is an add-on for Pandas which makes it easier to create summaries of dataframes. It can be considered as a combination of value counts and cross tab functions.

Once installed, it can be used as other accessors such as str and dt.

pip install sidetable
import sidetable

groceries.stb.freq(['itemDescription'], thresh=25)

Freq function returns a dataframe that conveys 3 pieces of information.

The number of observations (i.e. rows) for each category (value_counts()).
The percentage of each category in the entire column (value_counts(normalize=True)).
The cumulative versions of the two above.

Sidetable offers more functionality. I wrote a detailed article about sidetable if you’d like to read further.

35. Finding missing values

Missing values need to be handled very carefully in order to make accurate and robust analysis.

The isna function can be used the find the missing values in a dataframe. It returns true if the value is missing. Thus, we can count the total number of missing values by applying the sum function.

groceries.isna().sum()

Member_number      0
Date               0
itemDescription    0

We do not have any missing values in the groceries dataframe.

36. Handling missing values

The fillna function can be used to handle missing values. It provides many options to fill missing values such as mean, median, or a constant value.

We can also use the previous or next value to fill a missing value.

Let us first change a few values as missing value in the groceries dataframe.

groceries.iloc[[1,10,30], [1,2]] = np.nan

groceries.isna().sum()
Member_number      0
Date               3
itemDescription    3

We can use the most frequent item to fill missing values in the item description column. For the date column, we will use the previous value to replace a missing value.

groceries['itemDescription']\
.fillna(value=groceries['itemDescription'].mode()[0], inplace=True)

groceries['Date'].fillna(method='ffill', inplace=True)

groceries.isna().sum()
Member_number      0
Date               0
itemDescription    0

37. Selecting data types

The select_dtypes function can be used select columns that belong or does not belong to a particular data type.

marketing.select_dtypes(include='object').columns
Index(['Age', 'Gender', 'OwnHome', 'Married', 'Location', 'History'], dtype='object')

marketing.select_dtypes(exclude='object').columns
Index(['Salary', 'Children', 'Catalogs', 'AmountSpent'], dtype='object')

We can include or exclude certain data types.

38. Creating a dataframe

The DataFrame function can be used to create a dataframe. A dictionary can be passed to the DataFrame function. The keys will be the column names and the values will represent the row values.

Let’s create a dataframe that contains the prices of the items in the groceries dataframe.

unique_items = groceries.itemDescription.unique()

prices = pd.DataFrame({
    'itemDescription': unique_items,
    'prices':np.random.randint(10, size=len(unique_items))
})

We assign the prices randomly by creating a numpy array of random integers between 0 and 10.

39. Merging dataframes

The merge function can be used to merge two dataframes based on a shared column or columns. For instance, we can merge the groceries and price dataframes based on the item description column.

merged_df = groceries.merge(prices, on='itemDescription')

40. Correlations

When working on a machine learning task, the correlations between numerical variables need to be taken into consideration.

The corr function calculates the correlations and returns a matrix that contains correlation coefficients between variables.

As we can see, the salary and spent amount is highly correlated.

Conclusion

In this article and the previous one, we have covered a great deal of the functions and methods of Pandas.

As you keep using pandas for your data analysis tasks, you may discover new functions and methods. As with any other subject, practice makes perfect.

Thank you for reading. Please let me know if you have any feedback.