avatarLiu Zuo Lin

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

13860

Abstract

ing">'science'</span>, <span class="hljs-string">'history'</span>, <span class="hljs-string">'geography'</span>])</pre></div><p id="6a25">A student is eligible for math olympiad if both math and science scores are <i>above </i>80. Create a new column <code>olympiad</code> containing boolean values. True means the student is eligible for the math olympiad, and False means that the student is not eligible.</p><figure id="a929"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*YG3ARBHV-kfRc2SRAlDiJw.png"><figcaption></figcaption></figure><h1 id="dc4b">** Questions</h1><h2 id="04d8">11) Finding the mean + median per subject</h2><div id="1e34"><pre>df = pd.DataFrame([ [<span class="hljs-number">60</span>, <span class="hljs-number">70</span>, <span class="hljs-number">72</span>, <span class="hljs-number">90</span>, <span class="hljs-number">74</span>], [<span class="hljs-number">76</span>, <span class="hljs-number">70</span>, <span class="hljs-number">80</span>, <span class="hljs-number">84</span>, <span class="hljs-number">62</span>], [<span class="hljs-number">92</span>, <span class="hljs-number">70</span>, <span class="hljs-number">64</span>, <span class="hljs-number">82</span>, <span class="hljs-number">94</span>], [<span class="hljs-number">88</span>, <span class="hljs-number">68</span>, <span class="hljs-number">98</span>, <span class="hljs-number">90</span>, <span class="hljs-number">100</span>], [<span class="hljs-number">86</span>, <span class="hljs-number">70</span>, <span class="hljs-number">78</span>, <span class="hljs-number">66</span>, <span class="hljs-number">96</span>], ], columns=[<span class="hljs-string">'english'</span>, <span class="hljs-string">'math'</span>, <span class="hljs-string">'science'</span>, <span class="hljs-string">'history'</span>, <span class="hljs-string">'geography'</span>])</pre></div><figure id="052a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*KJGs84VIlsGVx70wtfw6UA.png"><figcaption></figcaption></figure><h2 id="0267">12) Filling NaN with median of column</h2><div id="9f15"><pre>df = pd.DataFrame([ [<span class="hljs-number">60</span>, <span class="hljs-literal">None</span>, <span class="hljs-number">72</span>, <span class="hljs-number">90</span>, <span class="hljs-number">74</span>], [<span class="hljs-number">76</span>, <span class="hljs-number">70</span>, <span class="hljs-number">80</span>, <span class="hljs-literal">None</span>, <span class="hljs-number">62</span>], [<span class="hljs-number">92</span>, <span class="hljs-number">70</span>, <span class="hljs-number">64</span>, <span class="hljs-number">82</span>, <span class="hljs-number">94</span>], [<span class="hljs-literal">None</span>, <span class="hljs-number">68</span>, <span class="hljs-number">98</span>, <span class="hljs-number">90</span>, <span class="hljs-number">100</span>], [<span class="hljs-number">86</span>, <span class="hljs-number">70</span>, <span class="hljs-number">78</span>, <span class="hljs-number">66</span>, <span class="hljs-literal">None</span>], ], columns=[<span class="hljs-string">'english'</span>, <span class="hljs-string">'math'</span>, <span class="hljs-string">'science'</span>, <span class="hljs-string">'history'</span>, <span class="hljs-string">'geography'</span>])</pre></div><figure id="157e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*a2EJKJ5YZJT9GZIYKinlMw.png"><figcaption></figcaption></figure><h2 id="4cb0">13) Average (mean) price per shop</h2><div id="b07a"><pre>df = pd.DataFrame([ [<span class="hljs-string">'A'</span>, <span class="hljs-string">'apple'</span>, <span class="hljs-number">1.5</span>, <span class="hljs-number">20</span>], [<span class="hljs-string">'A'</span>, <span class="hljs-string">'orange'</span>, <span class="hljs-number">2.0</span>, <span class="hljs-number">30</span>], [<span class="hljs-string">'A'</span>, <span class="hljs-string">'pear'</span>, <span class="hljs-number">2.5</span>, <span class="hljs-number">10</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-string">'apple'</span>, <span class="hljs-number">3.0</span>, <span class="hljs-number">8</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-string">'orange'</span>, <span class="hljs-number">3.5</span>, <span class="hljs-number">20</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-string">'pear'</span>, <span class="hljs-number">4.0</span>, <span class="hljs-number">10</span>], ], columns=[<span class="hljs-string">'shop'</span>, <span class="hljs-string">'fruit'</span>, <span class="hljs-string">'price'</span>, <span class="hljs-string">'quantity'</span>])</pre></div><figure id="fdaf"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jPLxcYq0K8d3m86wupQ4sQ.png"><figcaption></figcaption></figure><h2 id="4be3">14) Average (median) price per fruit</h2><div id="5d23"><pre>df = pd.DataFrame([ [<span class="hljs-string">'A'</span>, <span class="hljs-string">'apple'</span>, <span class="hljs-number">1.5</span>, <span class="hljs-number">20</span>], [<span class="hljs-string">'A'</span>, <span class="hljs-string">'orange'</span>, <span class="hljs-number">2.0</span>, <span class="hljs-number">30</span>], [<span class="hljs-string">'A'</span>, <span class="hljs-string">'pear'</span>, <span class="hljs-number">2.5</span>, <span class="hljs-number">10</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-string">'apple'</span>, <span class="hljs-number">3.0</span>, <span class="hljs-number">8</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-string">'orange'</span>, <span class="hljs-number">3.5</span>, <span class="hljs-number">20</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-string">'pear'</span>, <span class="hljs-number">4.0</span>, <span class="hljs-number">10</span>], ], columns=[<span class="hljs-string">'shop'</span>, <span class="hljs-string">'fruit'</span>, <span class="hljs-string">'price'</span>, <span class="hljs-string">'quantity'</span>])</pre></div><figure id="fe61"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*EkeY0fTmixRcp1BfDWRWLA.png"><figcaption></figcaption></figure><h2 id="e9ae">15) Area, Circumference, Volume and Surface Area</h2><div id="a449"><pre>df = pd.DataFrame([ [<span class="hljs-number">1</span>], [<span class="hljs-number">1.5</span>], [<span class="hljs-number">2</span>], [<span class="hljs-number">10</span>] ], columns=[<span class="hljs-string">'radius'</span>])</pre></div><p id="ff03">Where <code>radius</code> represents the radius of a circle/sphere, and:</p><ul><li>pi = 3.14159</li><li>area = pi * radius²</li><li>circumference = 2 * pi * radius</li><li>volume = 4 / 3 * pi * radius³</li><li>suface area = 4 * pi * radius²</li></ul><figure id="4145"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*6T1NGKFnGaRMnGCZTgcf6A.png"><figcaption></figcaption></figure><h2 id="22c4">16) Email feature extraction</h2><div id="8cef"><pre>df = pd.DataFrame([ [<span class="hljs-string">'[email protected]'</span>], [<span class="hljs-string">'[email protected]'</span>], [<span class="hljs-string">'[email protected]'</span>], [<span class="hljs-string">'[email protected]'</span>], ], columns=[<span class="hljs-string">'email'</span>])</pre></div><p id="aff0">Write some code to extract the following information</p><ul><li><code>name</code> — the stuff before the <code>@</code></li><li><code>host</code> — the stuff after the <code>@</code></li><li><code>tld</code> — the stuff after the last <code>.</code></li></ul><figure id="c50d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*px02cnpk0ab6znWrgddSjA.png"><figcaption></figcaption></figure><h2 id="39d2">17) Moving datetime column to index</h2><div id="7344"><pre>df = pd.DataFrame([ [<span class="hljs-string">'2022-01-03'</span>, <span class="hljs-number">100</span>], [<span class="hljs-string">'2022-01-02'</span>, <span class="hljs-number">110</span>], [<span class="hljs-string">'2022-01-01'</span>, <span class="hljs-number">120</span>], [<span class="hljs-string">'2022-01-08'</span>, <span class="hljs-number">130</span>], [<span class="hljs-string">'2022-01-05'</span>, <span class="hljs-number">120</span>], [<span class="hljs-string">'2022-01-06'</span>, <span class="hljs-number">140</span>], [<span class="hljs-string">'2022-01-07'</span>, <span class="hljs-number">150</span>], [<span class="hljs-string">'2022-01-04'</span>, <span class="hljs-number">120</span>], ], columns=[<span class="hljs-string">'date'</span>, <span class="hljs-string">'price'</span>])</pre></div><p id="91c2">Write some code to first convert the datetime column to actual datetime objects, then move it to the index of the dataframe. Remember to sort the values by date.</p><figure id="08a6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*kx50iKFW-ilsCeITQBU-4A.png"><figcaption></figcaption></figure><h2 id="0173">18) Mean value per month</h2><div id="20f3"><pre>df = pd.DataFrame([ [<span class="hljs-string">'2022-01-03'</span>, <span class="hljs-number">100</span>], [<span class="hljs-string">'2022-01-02'</span>, <span class="hljs-number">110</span>], [<span class="hljs-string">'2022-01-01'</span>, <span class="hljs-number">120</span>], [<span class="hljs-string">'2022-02-08'</span>, <span class="hljs-number">130</span>], [<span class="hljs-string">'2022-02-05'</span>, <span class="hljs-number">120</span>], [<span class="hljs-string">'2022-05-06'</span>, <span class="hljs-number">140</span>], [<span class="hljs-string">'2022-05-07'</span>, <span class="hljs-number">150</span>], [<span class="hljs-string">'2022-05-04'</span>, <span class="hljs-number">120</span>], ], columns=[<span class="hljs-string">'date'</span>, <span class="hljs-string">'value'</span>])</pre></div><p id="e401">Here, the dates are in <code>yyyy-mm-dd</code> format. Find the average value per month.</p><figure id="436b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*NIUviY8Tk2x5gJlblXZV-g.png"><figcaption></figcaption></figure><h2 id="db18">19) Cleaning dirty numbers</h2><div id="fab6"><pre>df = pd.DataFrame([ [<span class="hljs-string">'A'</span>, <span class="hljs-string">'100,000'</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-string">'80,000'</span>], [<span class="hljs-string">'C'</span>, <span class="hljs-string">'20,200'</span>], [<span class="hljs-string">'D'</span>, <span class="hljs-string">'50,000'</span>], [<span class="hljs-string">'E'</span>, <span class="hljs-string">'10,000'</span>], ], columns=[<span class="hljs-string">'item'</span>, <span class="hljs-string">'cost'</span>])</pre></div><p id="74cd">Here, the numbers have commas in them, causing pandas to think that they are strings. Convert each number string to an actual number.</p><figure id="97c2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*N79VC4Rf_InUZLVONtRcpg.png"><figcaption></figcaption></figure><h2 id="d425">20) Fixing duplicated entries</h2><div id="5b4f"><pre>df = pd.DataFrame([ [<span class="hljs-string">'A'</span>, <span class="hljs-number">20</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-number">25</span>], [<span class="hljs-string">'C'</span>, <span class="hljs-number">40</span>], [<span class="hljs-string">'A'</span>, <span class="hljs-number">22</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-number">1</span>], [<span class="hljs-string">'A'</span>, <span class="hljs-number">1</span>], ], columns=[<span class="hljs-string">'item'</span>, <span class="hljs-string">'quantity'</span>])</pre></div><p id="d243">Here, multiple items and their quantities are duplicated instead of combined. Write some code to combine them instead.</p><figure id="52fe"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*iOQ1fFlxylSCfXXGSRQhug.png"><figcaption></figcaption></figure><h1 id="2b81">*** Questions</h1><h2 id="0aeb">21) Storage conversion</h2><div id="f79f"><pre>df = pd.DataFrame([ [<span class="hljs-string">'A'</span>, <span class="hljs-string">'1TB'</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-string">'1 tb'</span>], [<span class="hljs-string">'C'</span>, <span class="hljs-string">'256 GB'</span>], [<span class="hljs-string">'D'</span>, <span class="hljs-string">'512MB'</span>], [<span class="hljs-string">'E'</span>, <span class="hljs-string">'512GB'</span>], ], columns=[<span class="hljs-string">'laptop'</span>, <span class="hljs-string">'storage'</span>])</pre></div><p id="d6ad">Let’s assume that:</p><ul><li>1 TB == 1000 GB</li><li>1 GB == 1000 MB</li></ul><p id="a888">Clean the ‘storage’ column, and convert the values to megabytes. Be careful of the inconsistent spaces/casing.</p><figure id="e4e2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*fLU_Or2xzJbxTtbxhp7RUA.png"><figcaption></figcaption></figure><h2 id="24bf">22) Feature extraction from dirty data</h2><div id="27c0"><pre>df = pd.DataFrame([ [<span class="hljs-number">1</span>, <span class="hljs-string">'name=rocky;age=4'</span>], [<span class="hljs-number">2</span>, <span class="hljs-string">'name=ricky;breed=dog'</span>], [<span class="hljs-number">3</span>, <span class="hljs-string">'breed=dog'</span>], [<span class="hljs-number">4</span>, <span class="hljs-string">'name=ducky;age=5;breed=duck'</span>], [<span class="hljs-number">5</span>, <span class="hljs-string">'age=6;breed=cat'</span>], ], columns=[<span class="hljs-string">'pet_id'</span>, <span class="hljs-string">'desc'</span>])</pre></div><p id="58

Options

2f">Here, the <code>desc</code> column contains multiple key-value pairs stored in the form <code>key1=value1;key2=value2</code></p><figure id="4eaf"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*7nc_BYvRpnVT-z9fUeGXVA.png"><figcaption></figcaption></figure><h2 id="0987">23) Grading scores</h2><div id="adaf"><pre>df = pd.DataFrame([ [<span class="hljs-number">60</span>, <span class="hljs-number">70</span>, <span class="hljs-number">72</span>, <span class="hljs-number">90</span>, <span class="hljs-number">74</span>], [<span class="hljs-number">76</span>, <span class="hljs-number">70</span>, <span class="hljs-number">80</span>, <span class="hljs-number">84</span>, <span class="hljs-number">62</span>], [<span class="hljs-number">92</span>, <span class="hljs-number">70</span>, <span class="hljs-number">64</span>, <span class="hljs-number">82</span>, <span class="hljs-number">94</span>], [<span class="hljs-number">88</span>, <span class="hljs-number">68</span>, <span class="hljs-number">98</span>, <span class="hljs-number">90</span>, <span class="hljs-number">100</span>], [<span class="hljs-number">86</span>, <span class="hljs-number">70</span>, <span class="hljs-number">78</span>, <span class="hljs-number">66</span>, <span class="hljs-number">96</span>], ], columns=[<span class="hljs-string">'english'</span>, <span class="hljs-string">'math'</span>, <span class="hljs-string">'science'</span>, <span class="hljs-string">'history'</span>, <span class="hljs-string">'geography'</span>])</pre></div><p id="2c86">Convert each score to a grade:</p><ul><li>91 to 100 → A</li><li>81 to 90 → B</li><li>71 to 80 → C</li><li>61 to 70 → D</li><li>60 and below → E</li></ul><figure id="e7b9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*veW926e9zj6s1ey4MPhh9w.png"><figcaption></figcaption></figure><h2 id="38b7">24) Ranking scores</h2><div id="93c1"><pre>df = pd.DataFrame([ [<span class="hljs-number">60</span>, <span class="hljs-number">70</span>, <span class="hljs-number">72</span>, <span class="hljs-number">90</span>, <span class="hljs-number">74</span>], [<span class="hljs-number">76</span>, <span class="hljs-number">71</span>, <span class="hljs-number">80</span>, <span class="hljs-number">84</span>, <span class="hljs-number">62</span>], [<span class="hljs-number">92</span>, <span class="hljs-number">72</span>, <span class="hljs-number">64</span>, <span class="hljs-number">82</span>, <span class="hljs-number">94</span>], [<span class="hljs-number">88</span>, <span class="hljs-number">68</span>, <span class="hljs-number">98</span>, <span class="hljs-number">90</span>, <span class="hljs-number">100</span>], [<span class="hljs-number">86</span>, <span class="hljs-number">73</span>, <span class="hljs-number">78</span>, <span class="hljs-number">66</span>, <span class="hljs-number">96</span>], ], columns=[<span class="hljs-string">'english'</span>, <span class="hljs-string">'math'</span>, <span class="hljs-string">'science'</span>, <span class="hljs-string">'history'</span>, <span class="hljs-string">'geography'</span>]</pre></div><p id="b878">For each subject, rank each student’s scores using numbers 1 to 5.</p><figure id="221c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*8OaZr-BbvyvVbjQBmpsrnQ.png"><figcaption></figcaption></figure><h2 id="8e66">25) Outlier Detection</h2><div id="65cb"><pre>df = pd.DataFrame([ [<span class="hljs-number">50</span>,<span class="hljs-number">60</span>,<span class="hljs-number">70</span>,<span class="hljs-number">80</span>], [<span class="hljs-number">62</span>,<span class="hljs-number">74</span>,<span class="hljs-number">50</span>,<span class="hljs-number">55</span>], [<span class="hljs-number">50</span>,<span class="hljs-number">64</span>,<span class="hljs-number">71</span>,<span class="hljs-number">81</span>], [<span class="hljs-number">50</span>,<span class="hljs-number">64</span>,<span class="hljs-number">72</span>,<span class="hljs-number">82</span>], [<span class="hljs-number">53</span>,<span class="hljs-number">65</span>,<span class="hljs-number">67</span>,<span class="hljs-number">79</span>], ], columns=[<span class="hljs-string">'A'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'D'</span>])</pre></div><p id="5c88">Here, a value is considered an outlier as compared to the other values in its column if:</p><ul><li>It is more than (mean of column) + (standard deviation of column)</li><li>It is less than (mean of column) - (standard deviation of column)</li></ul><figure id="ff86"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9GlBOPN-Ghqx710TxpMCgg.png"><figcaption></figcaption></figure><h2 id="d172">26) Dirty data</h2><div id="8f2d"><pre>df = pd.DataFrame([ [<span class="hljs-string">'bob'</span>, <span class="hljs-string">'rocky,fifi,baaron'</span>], [<span class="hljs-string">'tim'</span>, <span class="hljs-string">'lucky,ricky'</span>], [<span class="hljs-string">'tom'</span>, <span class="hljs-string">'rex,lala'</span>] ], columns=[<span class="hljs-string">'owner'</span>, <span class="hljs-string">'dog'</span>])</pre></div><figure id="6335"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*HxPMhVQt2ugGufwgMbZ3kA.png"><figcaption></figcaption></figure><h2 id="e325">27) Average number per country per team</h2><div id="2f7b"><pre>df = pd.DataFrame([ [<span class="hljs-string">'SG'</span>, <span class="hljs-string">'A'</span>, <span class="hljs-number">1000</span>], [<span class="hljs-string">'SG'</span>, <span class="hljs-string">'A'</span>, <span class="hljs-number">1100</span>], [<span class="hljs-string">'SG'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-number">1200</span>], [<span class="hljs-string">'SG'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-number">1300</span>], [<span class="hljs-string">'MY'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-number">1400</span>], [<span class="hljs-string">'MY'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-number">1500</span>], [<span class="hljs-string">'MY'</span>, <span class="hljs-string">'D'</span>, <span class="hljs-number">1600</span>], [<span class="hljs-string">'MY'</span>, <span class="hljs-string">'D'</span>, <span class="hljs-number">1700</span>], ], columns=[<span class="hljs-string">'country'</span>, <span class="hljs-string">'team'</span>, <span class="hljs-string">'number'</span>])</pre></div><p id="97a8">Find the average <code>number</code> per country per team.</p><figure id="8bfe"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ICnRUZv8CdinUWxRu7LkEA.png"><figcaption></figcaption></figure><h2 id="953f">28) Expanding columns</h2><div id="60e1"><pre>df = pd.DataFrame([ [<span class="hljs-string">'A'</span>, <span class="hljs-number">100</span>, <span class="hljs-number">110</span>, <span class="hljs-number">120</span>, <span class="hljs-number">130</span>, <span class="hljs-number">140</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-number">101</span>, <span class="hljs-number">111</span>, <span class="hljs-number">121</span>, <span class="hljs-number">131</span>, <span class="hljs-number">141</span>], [<span class="hljs-string">'C'</span>, <span class="hljs-number">200</span>, <span class="hljs-number">210</span>, <span class="hljs-number">220</span>, <span class="hljs-number">230</span>, <span class="hljs-number">240</span>], [<span class="hljs-string">'D'</span>, <span class="hljs-number">303</span>, <span class="hljs-number">310</span>, <span class="hljs-number">320</span>, <span class="hljs-number">330</span>, <span class="hljs-number">340</span>], [<span class="hljs-string">'E'</span>, <span class="hljs-number">500</span>, <span class="hljs-number">510</span>, <span class="hljs-number">520</span>, <span class="hljs-number">530</span>, <span class="hljs-number">540</span>], ], columns=[<span class="hljs-string">'stock'</span>, <span class="hljs-string">'jan'</span>, <span class="hljs-string">'feb'</span>, <span class="hljs-string">'mar'</span>, <span class="hljs-string">'apr'</span>, <span class="hljs-string">'may'</span>])</pre></div><p id="cef0">^ This dataframe above might not be very friendly for analysis. Write some code to convert it into the new dataframe below.</p><figure id="12da"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*EbjWsaj3kjzl-irJLK5NiA.png"><figcaption></figcaption></figure><h2 id="67fd">29) Largest per month</h2><div id="8f94"><pre>df = pd.DataFrame([ [<span class="hljs-string">'A'</span>, <span class="hljs-number">100</span>, <span class="hljs-number">110</span>, <span class="hljs-number">120</span>, <span class="hljs-number">130</span>, <span class="hljs-number">140</span>], [<span class="hljs-string">'B'</span>, <span class="hljs-number">110</span>, <span class="hljs-number">101</span>, <span class="hljs-number">100</span>, <span class="hljs-number">135</span>, <span class="hljs-number">150</span>], [<span class="hljs-string">'C'</span>, <span class="hljs-number">100</span>, <span class="hljs-number">140</span>, <span class="hljs-number">100</span>, <span class="hljs-number">135</span>, <span class="hljs-number">60</span>], [<span class="hljs-string">'D'</span>, <span class="hljs-number">120</span>, <span class="hljs-number">130</span>, <span class="hljs-number">90</span>, <span class="hljs-number">121</span>, <span class="hljs-number">70</span>], [<span class="hljs-string">'E'</span>, <span class="hljs-number">100</span>, <span class="hljs-number">110</span>, <span class="hljs-number">20</span>, <span class="hljs-number">30</span>, <span class="hljs-number">40</span>], ], columns=[<span class="hljs-string">'stock'</span>, <span class="hljs-string">'jan'</span>, <span class="hljs-string">'feb'</span>, <span class="hljs-string">'mar'</span>, <span class="hljs-string">'apr'</span>, <span class="hljs-string">'may'</span>])</pre></div><p id="c384">Write some code to find the largest value per month out of all 5 stocks, and display them in the format below.</p><figure id="ca9f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*pYKeu9V0CBmmYjhxZUykVQ.png"><figcaption></figcaption></figure><h2 id="5dfd">30) Dealing with screwed up data</h2><div id="b735"><pre>df = pd.DataFrame([ [<span class="hljs-string">'name=rocky,age=4'</span>, <span class="hljs-string">''</span>, <span class="hljs-string">'breed=dog,gender=male'</span>], [<span class="hljs-string">'age=5'</span>, <span class="hljs-string">'name=fifi,breed=dog'</span>, <span class="hljs-string">'gender=female'</span>], [<span class="hljs-string">'breed=cat'</span>, <span class="hljs-string">'name=ricky'</span>, <span class="hljs-string">'age=6'</span>], [<span class="hljs-string">'age=7,name=lucky,breed=bird'</span>, <span class="hljs-string">''</span>, <span class="hljs-string">'gender=male'</span>], [<span class="hljs-string">''</span>, <span class="hljs-string">'age=8'</span>, <span class="hljs-string">'name=bucky,breed=chicken,gender=female'</span>] ], columns=[<span class="hljs-string">'desc1'</span>, <span class="hljs-string">'desc2'</span>, <span class="hljs-string">'desc3'</span>])</pre></div><p id="a5b7">Clean the above (extremely messy) dataframe to get the dataframe below. Missing values should be converted to NaN.</p><figure id="2bd6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-s93qmKt9A7I9YfqbdcF6w.png"><figcaption></figcaption></figure><h1 id="02ae">Conclusion</h1><p id="67d9">Hope these questions were helpful in your practice!</p><h1 id="40ec">Some Final words</h1><p id="589a"><i>If this story provided value to you, and you wish to show support, you could:</i></p><ol><li><i>Clap multiple times for this story (this really helps me out!)</i></li><li><i>Consider signing up for a Medium membership using my link — it’s $5 per month and you get to read unlimited stories on Medium.</i></li></ol><p id="c59f"><a href="https://zlliu.medium.com/membership"><b><i>Sign up using my link here to read unlimited Medium articles.</i></b></a></p><p id="9d4b"><b>Get my free Ebooks: <a href="https://zlliu.co/books">https://zlliu.co/books</a></b></p><p id="9bc1"><i>I write Python articles (sometimes other stuff) that the younger me would have wanted to read. Do join my email list to get notified whenever I publish.</i></p><div id="d77a" class="link-block"> <a href="https://zlliu.medium.com/subscribe"> <div> <div> <h2>Get an email whenever Liu Zuo Lin publishes.</h2> <div><h3>Get an email whenever Liu Zuo Lin publishes. By signing up, you will create a Medium account if you don't already have…</h3></div> <div><p>zlliu.medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*IgpLf7o7RdjrLUuK)"></div> </div> </div> </a> </div><p id="4174"><i>More content at <a href="https://plainenglish.io/"><b>PlainEnglish.io</b></a>. Sign up for our <a href="http://newsletter.plainenglish.io/"><b>free weekly newsletter</b></a>. Follow us on <a href="https://twitter.com/inPlainEngHQ"><b>Twitter</b></a></i>, <a href="https://www.linkedin.com/company/inplainenglish/"><b><i>LinkedIn</i></b></a><i>, <a href="https://www.youtube.com/channel/UCtipWUghju290NWcn8jhyAw"><b>YouTube</b></a>, and <a href="https://discord.gg/GtDtUAvyhW"><b>Discord</b></a>. Interested in Growth Hacking? Check out <a href="https://circuit.ooo/"><b>Circuit</b></a>.</i></p></article></body>

30 Pandas Practice Questions

# 10 Beginner + 10 Intermediate + 10 Tricky Questions

cool art

* Questions

1) Getting the first 3 rows and first 4 columns

df = pd.DataFrame([
    [60, 70, 72, 90, 74],
    [76, 70, 80, 84, 62],
    [92, 70, 64, 82, 94],
    [88, 68, 98, 90, 100],
    [86, 70, 78, 66, 96],
], columns=['english', 'math', 'science', 'history', 'geography'])

2) Getting all rows for science, history and english

df = pd.DataFrame([
    [60, 70, 72, 90, 74],
    [76, 70, 80, 84, 62],
    [92, 70, 64, 82, 94],
    [88, 68, 98, 90, 100],
    [86, 70, 78, 66, 96],
], columns=['english', 'math', 'science', 'history', 'geography'])

3) Renaming headers/columns

df = pd.DataFrame([
    [60, 70, 72, 90, 74],
    [76, 70, 80, 84, 62],
    [92, 70, 64, 82, 94],
    [88, 68, 98, 90, 100],
    [86, 70, 78, 66, 96],
], columns=['english', 'math', 'science', 'history', 'geography'])

Renaming the existing columns using the following dictionary:

{'english':'eng', 'science':'sci', 'history':'his', 'geography':'geo'}

4) Filling NaN values with 0

df = pd.DataFrame([
    [60, None, 72, 90, 74],
    [76, 70, 80, None, 62],
    [92, 70, 64, 82, 94],
    [None, 68, 98, 90, 100],
    [86, 70, 78, 66, None],
], columns=['english', 'math', 'science', 'history', 'geography'])

Replace all NaN values with 0.0

5) Finding Mean Scores Per Student

df = pd.DataFrame([
    [60, 70, 72, 90, 74],
    [76, 70, 80, 84, 62],
    [92, 70, 64, 82, 94],
    [88, 68, 98, 90, 100],
    [86, 70, 78, 66, 96],
], columns=['english', 'math', 'science', 'history', 'geography']) 

Create a new column named avg representing the average score of each student.

6) Creating a new ‘worth’ column

df = pd.DataFrame([
    ['A', 'apple', 1.5, 20],
    ['A', 'orange', 2.0, 30],
    ['A', 'pear', 2.5, 10],
    ['B', 'apple', 3.0, 8],
    ['B', 'orange', 3.5, 20],
    ['B', 'pear', 4.0, 10],
], columns=['shop', 'fruit', 'price', 'quantity'])

Where worth = price * quantity

7) Filtering prices less than $3

df = pd.DataFrame([
    ['A', 'apple', 1.5, 20],
    ['A', 'orange', 2.0, 30],
    ['A', 'pear', 2.5, 10],
    ['B', 'apple', 3.0, 8],
    ['B', 'orange', 3.5, 20],
    ['B', 'pear', 4.0, 10],
], columns=['shop', 'fruit', 'price', 'quantity'])

Keep only rows where price is less than $3.

8) String cleaning

df = pd.DataFrame([
    ['Apple ', 'a'],
    ['   oRanGe ', 'B'],
    ['pear    ', '    c'],
], columns=['fruit', 'shop'])

For each string, removing all whitespaces at the beginning/end of the string, and convert the entire string into lowercase.

9) Keeping only apples

df = pd.DataFrame([
    ['apple', 4],
    ['apple', 5],
    ['orange', 6],
    ['apple', 7],
    ['orange', 8],
    ['apple', 9],
    ['pear', 10],
], columns=['fruit', 'price'])

10) Math Olympiad

df = pd.DataFrame([
    [60, 70, 82, 90, 74],
    [76, 81, 81, 84, 62],
    [92, 92, 84, 82, 94],
    [88, 68, 98, 90, 100],
    [86, 93, 78, 66, 96],
], columns=['english', 'math', 'science', 'history', 'geography'])

A student is eligible for math olympiad if both math and science scores are above 80. Create a new column olympiad containing boolean values. True means the student is eligible for the math olympiad, and False means that the student is not eligible.

** Questions

11) Finding the mean + median per subject

df = pd.DataFrame([
    [60, 70, 72, 90, 74],
    [76, 70, 80, 84, 62],
    [92, 70, 64, 82, 94],
    [88, 68, 98, 90, 100],
    [86, 70, 78, 66, 96],
], columns=['english', 'math', 'science', 'history', 'geography'])

12) Filling NaN with median of column

df = pd.DataFrame([
    [60, None, 72, 90, 74],
    [76, 70, 80, None, 62],
    [92, 70, 64, 82, 94],
    [None, 68, 98, 90, 100],
    [86, 70, 78, 66, None],
], columns=['english', 'math', 'science', 'history', 'geography'])

13) Average (mean) price per shop

df = pd.DataFrame([
    ['A', 'apple', 1.5, 20],
    ['A', 'orange', 2.0, 30],
    ['A', 'pear', 2.5, 10],
    ['B', 'apple', 3.0, 8],
    ['B', 'orange', 3.5, 20],
    ['B', 'pear', 4.0, 10],
], columns=['shop', 'fruit', 'price', 'quantity'])

14) Average (median) price per fruit

df = pd.DataFrame([
    ['A', 'apple', 1.5, 20],
    ['A', 'orange', 2.0, 30],
    ['A', 'pear', 2.5, 10],
    ['B', 'apple', 3.0, 8],
    ['B', 'orange', 3.5, 20],
    ['B', 'pear', 4.0, 10],
], columns=['shop', 'fruit', 'price', 'quantity'])

15) Area, Circumference, Volume and Surface Area

df = pd.DataFrame([
    [1],
    [1.5],
    [2],
    [10]
], columns=['radius'])

Where radius represents the radius of a circle/sphere, and:

  • pi = 3.14159
  • area = pi * radius²
  • circumference = 2 * pi * radius
  • volume = 4 / 3 * pi * radius³
  • suface area = 4 * pi * radius²

16) Email feature extraction

df = pd.DataFrame([
    ['[email protected]'],
    ['[email protected]'],
    ['[email protected]'],
    ['[email protected]'],
], columns=['email'])

Write some code to extract the following information

  • name — the stuff before the @
  • host — the stuff after the @
  • tld — the stuff after the last .

17) Moving datetime column to index

df = pd.DataFrame([
    ['2022-01-03', 100],
    ['2022-01-02', 110],
    ['2022-01-01', 120],
    ['2022-01-08', 130],
    ['2022-01-05', 120],
    ['2022-01-06', 140],
    ['2022-01-07', 150],
    ['2022-01-04', 120],
], columns=['date', 'price'])

Write some code to first convert the datetime column to actual datetime objects, then move it to the index of the dataframe. Remember to sort the values by date.

18) Mean value per month

df = pd.DataFrame([
    ['2022-01-03', 100],
    ['2022-01-02', 110],
    ['2022-01-01', 120],
    ['2022-02-08', 130],
    ['2022-02-05', 120],
    ['2022-05-06', 140],
    ['2022-05-07', 150],
    ['2022-05-04', 120],
], columns=['date', 'value'])

Here, the dates are in yyyy-mm-dd format. Find the average value per month.

19) Cleaning dirty numbers

df = pd.DataFrame([
    ['A', '100,000'],
    ['B', '80,000'],
    ['C', '20,200'],
    ['D', '50,000'],
    ['E', '10,000'],
], columns=['item', 'cost'])

Here, the numbers have commas in them, causing pandas to think that they are strings. Convert each number string to an actual number.

20) Fixing duplicated entries

df = pd.DataFrame([
    ['A', 20],
    ['B', 25],
    ['C', 40],
    ['A', 22],
    ['B', 1],
    ['A', 1],
], columns=['item', 'quantity'])

Here, multiple items and their quantities are duplicated instead of combined. Write some code to combine them instead.

*** Questions

21) Storage conversion

df = pd.DataFrame([
    ['A', '1TB'],
    ['B', '1 tb'],
    ['C', '256 GB'],
    ['D', '512MB'],
    ['E', '512GB'],
], columns=['laptop', 'storage'])

Let’s assume that:

  • 1 TB == 1000 GB
  • 1 GB == 1000 MB

Clean the ‘storage’ column, and convert the values to megabytes. Be careful of the inconsistent spaces/casing.

22) Feature extraction from dirty data

df = pd.DataFrame([
    [1, 'name=rocky;age=4'],
    [2, 'name=ricky;breed=dog'],
    [3, 'breed=dog'],
    [4, 'name=ducky;age=5;breed=duck'],
    [5, 'age=6;breed=cat'],
], columns=['pet_id', 'desc'])

Here, the desc column contains multiple key-value pairs stored in the form key1=value1;key2=value2

23) Grading scores

df = pd.DataFrame([
    [60, 70, 72, 90, 74],
    [76, 70, 80, 84, 62],
    [92, 70, 64, 82, 94],
    [88, 68, 98, 90, 100],
    [86, 70, 78, 66, 96],
], columns=['english', 'math', 'science', 'history', 'geography'])

Convert each score to a grade:

  • 91 to 100 → A
  • 81 to 90 → B
  • 71 to 80 → C
  • 61 to 70 → D
  • 60 and below → E

24) Ranking scores

df = pd.DataFrame([
    [60, 70, 72, 90, 74],
    [76, 71, 80, 84, 62],
    [92, 72, 64, 82, 94],
    [88, 68, 98, 90, 100],
    [86, 73, 78, 66, 96],
], columns=['english', 'math', 'science', 'history', 'geography']

For each subject, rank each student’s scores using numbers 1 to 5.

25) Outlier Detection

df = pd.DataFrame([
    [50,60,70,80],
    [62,74,50,55],
    [50,64,71,81],
    [50,64,72,82],
    [53,65,67,79],
], columns=['A', 'B', 'C', 'D'])

Here, a value is considered an outlier as compared to the other values in its column if:

  • It is more than (mean of column) + (standard deviation of column)
  • It is less than (mean of column) - (standard deviation of column)

26) Dirty data

df = pd.DataFrame([
    ['bob', 'rocky,fifi,baaron'],
    ['tim', 'lucky,ricky'],
    ['tom', 'rex,lala']
], columns=['owner', 'dog'])

27) Average number per country per team

df = pd.DataFrame([
    ['SG', 'A', 1000],
    ['SG', 'A', 1100],
    ['SG', 'B', 1200],
    ['SG', 'B', 1300],
    ['MY', 'C', 1400],
    ['MY', 'C', 1500],
    ['MY', 'D', 1600],
    ['MY', 'D', 1700],
], columns=['country', 'team', 'number'])

Find the average number per country per team.

28) Expanding columns

df = pd.DataFrame([
    ['A', 100, 110, 120, 130, 140],
    ['B', 101, 111, 121, 131, 141],
    ['C', 200, 210, 220, 230, 240],
    ['D', 303, 310, 320, 330, 340],
    ['E', 500, 510, 520, 530, 540],
], columns=['stock', 'jan', 'feb', 'mar', 'apr', 'may'])

^ This dataframe above might not be very friendly for analysis. Write some code to convert it into the new dataframe below.

29) Largest per month

df = pd.DataFrame([
    ['A', 100, 110, 120, 130, 140],
    ['B', 110, 101, 100, 135, 150],
    ['C', 100, 140, 100, 135, 60],
    ['D', 120, 130, 90, 121, 70],
    ['E', 100, 110, 20, 30, 40],
], columns=['stock', 'jan', 'feb', 'mar', 'apr', 'may'])

Write some code to find the largest value per month out of all 5 stocks, and display them in the format below.

30) Dealing with screwed up data

df = pd.DataFrame([
    ['name=rocky,age=4', '', 'breed=dog,gender=male'],
    ['age=5', 'name=fifi,breed=dog', 'gender=female'],
    ['breed=cat', 'name=ricky', 'age=6'],
    ['age=7,name=lucky,breed=bird', '', 'gender=male'],
    ['', 'age=8', 'name=bucky,breed=chicken,gender=female']
], columns=['desc1', 'desc2', 'desc3'])

Clean the above (extremely messy) dataframe to get the dataframe below. Missing values should be converted to NaN.

Conclusion

Hope these questions were helpful in your practice!

Some Final words

If this story provided value to you, and you wish to show support, you could:

  1. Clap multiple times for this story (this really helps me out!)
  2. Consider signing up for a Medium membership using my link — it’s $5 per month and you get to read unlimited stories on Medium.

Sign up using my link here to read unlimited Medium articles.

Get my free Ebooks: https://zlliu.co/books

I write Python articles (sometimes other stuff) that the younger me would have wanted to read. Do join my email list to get notified whenever I publish.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord. Interested in Growth Hacking? Check out Circuit.

Python
Python Programming
Pandas
Recommended from ReadMedium