Free AI web copilot to create summaries, insights and extended knowledge, download it at here
6814
Abstract
itle class_">Color</span>:
<span class="hljs-keyword">def</span> <span class="hljs-title function_">init</span>(<span class="hljs-params"><span class="hljs-variable language_">self</span>, r, g, b</span>):
<span class="hljs-variable language_">self</span>.r, <span class="hljs-variable language_">self</span>.g, <span class="hljs-variable language_">self</span>.b = r, g, b</pre></div><div id="f6e6"><pre><span class="hljs-comment"># or this</span>
<span class="hljs-attribute">RED</span>, GREEN, BLUE = <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>
<span class="hljs-attribute">c</span> = (<span class="hljs-number">0</span>, <span class="hljs-number">255</span>, <span class="hljs-number">120</span>) # lime green
<span class="hljs-attribute">red_value</span> = c[RED]</pre></div><div id="0508"><pre><span class="hljs-comment"># do this</span>
<span class="hljs-keyword">from</span> collections import namedtuple
Color = namedtuple(<span class="hljs-string">'Color'</span>, [<span class="hljs-string">'r'</span>, <span class="hljs-string">'g'</span>, <span class="hljs-string">'b'</span>])
c = Color(0, <span class="hljs-attribute">g</span>=255, <span class="hljs-attribute">b</span>=120) # lime green
red_value = c.r</pre></div><p id="5530">Example 4: <code>bisect.insort</code> for maintaining a sorted list</p><div id="5d08"><pre><span class="hljs-comment"># just do this</span>
<span class="hljs-attribute">from</span> bisect import insort
<span class="hljs-attribute">my_sorted_list</span> =<span class="hljs-meta"> []</span>
<span class="hljs-attribute">for</span> i in (<span class="hljs-number">1</span>, <span class="hljs-number">324</span>, <span class="hljs-number">52</span>, <span class="hljs-number">568</span>, <span class="hljs-number">24</span>, <span class="hljs-number">12</span>, <span class="hljs-number">8</span>):
<span class="hljs-attribute">insort</span>(my_sorted_list, i)</pre></div><p id="7570">Another common pitfall is using overkilling data structures. For example, if you never change the content of a list once defined, a tuple would save you a lot of memory; or if you only use the first or last element of a list, then go for a deque.</p><p id="705d">More on this topic:</p><div id="1685" class="link-block">
<a href="https://docs.python.org/3/library/">
<div>
<div>
<h2>The Python Standard Library - Python 3.9.0 documentation</h2>
<div><h3>While The Python Language Reference describes the exact syntax and semantics of the Python language, this library…</h3></div>
<div><p>docs.python.org</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/)"></div>
</div>
</div>
</a>
</div><h2 id="fa2e">4. Streamline your code when packaging Notebooks</h2><p id="f09b">We all know it ourselves, development notebooks are some of the worst scripts in the world when it comes to readability, scalability and maintainability. There are generally a couple of rules of thumb when it comes to packaging your notebooks into a Python package:</p><ul><li><b>Identify repeating code snippets and group them into a function</b>: For example, instead of writing a regex expression for keyword matching every time, I have this snippet:</li></ul>
<figure id="96bf">
<div>
<div>
<iframe class="gist-iframe" src="/gist/wululoo/52c71a836866096a02a9b9b0feb0b777.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><ul><li><b>Remove redundant cells, lines and variables</b>: Yes, we are talking about <code>df_temp = df.copy()</code>, <code>gc.collect()</code>, <code>temp = df.some_column.unique()</code></li><li><b>Moving all your functions and imports to the beginning of the notebook: </b>A good lesson that C taught me when I was at Uni was that define everything you need upfront, and you will never need to worry about where to find them</li><li><b>Untangle and sort cells by sequence of execution: </b>The last thing you want is to remember when did you jump from the last cell of the notebook to the very first</li></ul><h2 id="0215">5. Inherit existing Abstract Classes whenever possible</h2><p id="1fc3">Object-oriented programming vs functional programming will forever be a debate when it comes to organising your code. But bottom line, using either one is better than using none. When it comes to data science, there are a number of packages that are organised in a OOP manner, e.g. sci-kit learn, nltk, spacy, and more. If you manage to package your script into classes that extend their abstract classes, it would help organise your code better, and allow better integration with your data science pipeline.</p><p id="3f52">For example, based on my previous snippet of regex expression construction, I would like to construct a tokeniser that returns only the tokens captured by the function. By inheriting nltk’s <code>TokenizerI</code>, my <code>RegexpEntityTokenizer</code> now has the same structure as say nltk’s <code>PunktSentenceTokenizer</code>. This means that I can now pass my custom tokenizer to a sklearn <code>Pipeline</code> for a seamless workflow.</p>
<figure id="9222">
<div>
<div>
<iframe class="gist-iframe" src="/gist/wululoo/50953851e86debd69ece8d4ca4f4896f.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="feed">More on this topic here:</p><div id="8708" class="link-block">
<a href="https://towardsdatascience.com/object-oriented-programming-for-data-scientists-build-your-ml-estimator-7da416751f64">
<div>
<div>
<h2>Object-oriented programming for data scientists: Build your ML estimator</h2>
<div><h3>Implement some of the core OOP principles in a machine learning context by building your own Scikit-learn-like…</h3></div>
<div><p>towardsdatascience.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*M_XOnlAT7CEkkhedW7kZEg.jpeg)"></div>
</div>
</div>
</a>
</div><h2 id="ae33">6. Stop pickling everything</h2><p id="5819">Yes, I know. Pickling DataFrames and other objects are convenient; you can basically serialise everything in your python script into a <co
Options
de>.pkl</code> file. However, pickle’s convenience comes with several weak links:</p><ul><li><b>Not designed for speed</b>: Pickle is designed for any object, making it slower than other more specialised serialisation</li><li><b>Not secure</b>: Unpacking a pickle file can execute some arbitrary code that has been hidden in the pickle file. Imagine if your ex has some birthday pranks prepared for you.</li><li><b>Not portable</b>: Not all pickled data is compatible between different versions of Python.</li></ul><p id="c87b">So remember, pickle is not the only way to saving your data and progress. Below are some of the alternatives:</p><ol><li><b>cPickle</b>: Bottom line, look into cPickle, which is pickle implemented in C, making it over 1,000 times faster than pickle.</li><li><b>JSON</b>: JSON serialise your data a lot faster than pickle. On top of that, JSON makes storing and editing dictionaries (or kwargs for configuring your models) possible outside Python in a text editor.</li><li><b>NumPy</b>: If you are serialising a well-defined structure that you can fit in a <code>numpy.ndarray</code>, NumPy’s <code>np.save</code> and <code>np.memmap</code> are some faster options</li><li><b>Joblib</b>: This should be no stranger to anyone that has been developing machine learning models. If your object contains large <code>np.ndarray</code>, then joblib may be suitable for you. It has basically the same, but a bit simpler, interface as pickle, so you should have no problem using it. However, do note that joblib comes with the same security concern as pickle.</li><li><b>h5py</b>: No stranger to those who has played with say tensorflow or keras before. HDF5 arrays can hold large amount of compressed numerical data. It is true that having a hierarchy data structure with compressed data means a bit of a learning curve, and also a bit harder to query your datasets; but on the flip side, you will get very efficient data I/O from it.</li></ol><h2 id="6159">7. Consider using a Database</h2><p id="a721">Last but never least, this would be a big jump and potentially change your game plan. But trust me, it is worth it after experiencing the conversion myself. Check to see if you have experienced any of the following:</p><ul><li>Cannot speed up your python script anymore on your huge pandas dataframe</li><li>Saving and loading your dataset is taking longer and longer</li><li>All of your data sits at the same place, whether it’s raw, external, or processed</li><li>New data keeps coming in and are saved as csvs</li></ul><p id="2117">These are all signs that a database will make your life a lot easier. And once you have a database in place, SQL would help you do a lot of heavy lifting. A lot of the times, I/O is the bottleneck for my Python scripts. By moving the processing just that one step closer to the data, read-write speed has been improved in a different order of magnitude. Best of all, whatever your requirements are, you will always be able to find a suitable database: MySQL, PostgreSQL, MongoDB, CouchDB, etc.</p><p id="8722">More on this topic here:</p><div id="5e81" class="link-block">
<a href="https://towardsdatascience.com/databases-101-introduction-to-databases-for-data-scientists-ee18c9f0785d">
<div>
<div>
<h2>Databases 101: Introduction to Databases for Data Scientists</h2>
<div><h3>How to get started with the world of databases?</h3></div>
<div><p>towardsdatascience.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*YL3oblyvva_2Uj3z0zhydA.png)"></div>
</div>
</div>
</a>
</div><h2 id="af7e">Before you go</h2><p id="aa9b">These are some of my medium blogs on data science and python that you may want to check out:</p><div id="fe23" class="link-block">
<a href="https://towardsdatascience.com/efficient-implementation-of-conditional-logic-on-pandas-dataframes-4afa61eb7fce">
<div>
<div>
<h2>Efficient Conditional Logic on Pandas DataFrames</h2>
<div><h3>Time to stop being too dependent on .iterrows() and .apply()</h3></div>
<div><p>towardsdatascience.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*7SOe2uC6xIast1ON)"></div>
</div>
</div>
</a>
</div><div id="a54c" class="link-block">
<a href="https://towardsdatascience.com/cookiecutter-plugin-for-jupyter-easily-organise-your-data-science-environment-a56f83140f72">
<div>
<div>
<h2>Essential Jupyter Extension for Data Science Set Up</h2>
<div><h3>A custom Jupyter extension that helps organise your project folders</h3></div>
<div><p>towardsdatascience.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*Q7T8lq8ibhUAeArb)"></div>
</div>
</div>
</a>
</div><div id="020f" class="link-block">
<a href="https://towardsdatascience.com/mastering-root-searching-algorithms-in-python-7120c335a2a8">
<div>
<div>
<h2>Efficient Root Searching Algorithms in Python</h2>
<div><h3>Implementing efficient searching algorithms for finding roots, and optimisation in Python</h3></div>
<div><p>towardsdatascience.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*HMgA8rFCUIxc6JPe)"></div>
</div>
</div>
</a>
</div><h2 id="98a9">That’s it</h2><p id="d6a9">Hopefully, you have found these tips useful for polishing your data science endeavours. Let me know if the comments how you find these tricks.</p><p id="ab6f">Adios!</p><div id="4ca7" class="link-block">
<a href="https://www.linkedin.com/in/louis-chan-b55b9287/">
<div>
<div>
<h2>Louis Chan - Director, Data Science - FTI Consulting | LinkedIn</h2>
<div><h3>Ambitious, curious and creative individual with a strong belief in inter-connectivity between branches knowledge and a…</h3></div>
<div><p>www.linkedin.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/)"></div>
</div>
</div>
</a>
</div></article></body>