Free AI web copilot to create summaries, insights and extended knowledge, download it at here

6814

Abstract

itle class_">Color</span>: <span class="hljs-keyword">def</span> <span class="hljs-title function_">init</span>(<span class="hljs-params"><span class="hljs-variable language_">self</span>, r, g, b</span>): <span class="hljs-variable language_">self</span>.r, <span class="hljs-variable language_">self</span>.g, <span class="hljs-variable language_">self</span>.b = r, g, b</pre></div><div id="f6e6"><pre><span class="hljs-comment"># or this</span> <span class="hljs-attribute">RED</span>, GREEN, BLUE = <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span> <span class="hljs-attribute">c</span> = (<span class="hljs-number">0</span>, <span class="hljs-number">255</span>, <span class="hljs-number">120</span>) # lime green <span class="hljs-attribute">red_value</span> = c[RED]</pre></div><div id="0508"><pre><span class="hljs-comment"># do this</span> <span class="hljs-keyword">from</span> collections import namedtuple Color = namedtuple(<span class="hljs-string">'Color'</span>, [<span class="hljs-string">'r'</span>, <span class="hljs-string">'g'</span>, <span class="hljs-string">'b'</span>]) c = Color(0, <span class="hljs-attribute">g</span>=255, <span class="hljs-attribute">b</span>=120) # lime green red_value = c.r</pre></div><p id="5530">Example 4: <code>bisect.insort</code> for maintaining a sorted list</p><div id="5d08"><pre><span class="hljs-comment"># just do this</span> <span class="hljs-attribute">from</span> bisect import insort <span class="hljs-attribute">my_sorted_list</span> =<span class="hljs-meta"> []</span> <span class="hljs-attribute">for</span> i in (<span class="hljs-number">1</span>, <span class="hljs-number">324</span>, <span class="hljs-number">52</span>, <span class="hljs-number">568</span>, <span class="hljs-number">24</span>, <span class="hljs-number">12</span>, <span class="hljs-number">8</span>): <span class="hljs-attribute">insort</span>(my_sorted_list, i)</pre></div><p id="7570">Another common pitfall is using overkilling data structures. For example, if you never change the content of a list once defined, a tuple would save you a lot of memory; or if you only use the first or last element of a list, then go for a deque.</p><p id="705d">More on this topic:</p><div id="1685" class="link-block"> <a href="https://docs.python.org/3/library/"> <div> <div> <h2>The Python Standard Library - Python 3.9.0 documentation</h2> <div><h3>While The Python Language Reference describes the exact syntax and semantics of the Python language, this library…</h3></div> <div><p>docs.python.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/)"></div> </div> </div> </a> </div><h2 id="fa2e">4. Streamline your code when packaging Notebooks</h2><p id="f09b">We all know it ourselves, development notebooks are some of the worst scripts in the world when it comes to readability, scalability and maintainability. There are generally a couple of rules of thumb when it comes to packaging your notebooks into a Python package:</p><ul><li><b>Identify repeating code snippets and group them into a function</b>: For example, instead of writing a regex expression for keyword matching every time, I have this snippet:</li></ul> <figure id="96bf"> <div> <div>

            <iframe class="gist-iframe" src="/gist/wululoo/52c71a836866096a02a9b9b0feb0b777.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><ul><li><b>Remove redundant cells, lines and variables</b>: Yes, we are talking about <code>df_temp = df.copy()</code>, <code>gc.collect()</code>, <code>temp = df.some_column.unique()</code></li><li><b>Moving all your functions and imports to the beginning of the notebook: </b>A good lesson that C taught me when I was at Uni was that define everything you need upfront, and you will never need to worry about where to find them</li><li><b>Untangle and sort cells by sequence of execution: </b>The last thing you want is to remember when did you jump from the last cell of the notebook to the very first</li></ul><h2 id="0215">5. Inherit existing Abstract Classes whenever possible</h2><p id="1fc3">Object-oriented programming vs functional programming will forever be a debate when it comes to organising your code. But bottom line, using either one is better than using none. When it comes to data science, there are a number of packages that are organised in a OOP manner, e.g. sci-kit learn, nltk, spacy, and more. If you manage to package your script into classes that extend their abstract classes, it would help organise your code better, and allow better integration with your data science pipeline.</p><p id="3f52">For example, based on my previous snippet of regex expression construction, I would like to construct a tokeniser that returns only the tokens captured by the function. By inheriting nltk’s <code>TokenizerI</code>, my <code>RegexpEntityTokenizer</code> now has the same structure as say nltk’s <code>PunktSentenceTokenizer</code>. This means that I can now pass my custom tokenizer to a sklearn <code>Pipeline</code> for a seamless workflow.</p>
    <figure id="9222">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/wululoo/50953851e86debd69ece8d4ca4f4896f.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="feed">More on this topic here:</p><div id="8708" class="link-block">
      <a href="https://towardsdatascience.com/object-oriented-programming-for-data-scientists-build-your-ml-estimator-7da416751f64">
        <div>
          <div>
            <h2>Object-oriented programming for data scientists: Build your ML estimator</h2>
            <div><h3>Implement some of the core OOP principles in a machine learning context by building your own Scikit-learn-like…</h3></div>
            <div><p>towardsdatascience.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*M_XOnlAT7CEkkhedW7kZEg.jpeg)"></div>
          </div>
        </div>
      </a>
    </div><h2 id="ae33">6. Stop pickling everything</h2><p id="5819">Yes, I know. Pickling DataFrames and other objects are convenient; you can basically serialise everything in your python script into a <co

Options

de>.pkl</code> file. However, pickle’s convenience comes with several weak links:</p><ul><li><b>Not designed for speed</b>: Pickle is designed for any object, making it slower than other more specialised serialisation</li><li><b>Not secure</b>: Unpacking a pickle file can execute some arbitrary code that has been hidden in the pickle file. Imagine if your ex has some birthday pranks prepared for you.</li><li><b>Not portable</b>: Not all pickled data is compatible between different versions of Python.</li></ul><p id="c87b">So remember, pickle is not the only way to saving your data and progress. Below are some of the alternatives:</p><ol><li><b>cPickle</b>: Bottom line, look into cPickle, which is pickle implemented in C, making it over 1,000 times faster than pickle.</li><li><b>JSON</b>: JSON serialise your data a lot faster than pickle. On top of that, JSON makes storing and editing dictionaries (or kwargs for configuring your models) possible outside Python in a text editor.</li><li><b>NumPy</b>: If you are serialising a well-defined structure that you can fit in a <code>numpy.ndarray</code>, NumPy’s <code>np.save</code> and <code>np.memmap</code> are some faster options</li><li><b>Joblib</b>: This should be no stranger to anyone that has been developing machine learning models. If your object contains large <code>np.ndarray</code>, then joblib may be suitable for you. It has basically the same, but a bit simpler, interface as pickle, so you should have no problem using it. However, do note that joblib comes with the same security concern as pickle.</li><li><b>h5py</b>: No stranger to those who has played with say tensorflow or keras before. HDF5 arrays can hold large amount of compressed numerical data. It is true that having a hierarchy data structure with compressed data means a bit of a learning curve, and also a bit harder to query your datasets; but on the flip side, you will get very efficient data I/O from it.</li></ol><h2 id="6159">7. Consider using a Database</h2><p id="a721">Last but never least, this would be a big jump and potentially change your game plan. But trust me, it is worth it after experiencing the conversion myself. Check to see if you have experienced any of the following:</p><ul><li>Cannot speed up your python script anymore on your huge pandas dataframe</li><li>Saving and loading your dataset is taking longer and longer</li><li>All of your data sits at the same place, whether it’s raw, external, or processed</li><li>New data keeps coming in and are saved as csvs</li></ul><p id="2117">These are all signs that a database will make your life a lot easier. And once you have a database in place, SQL would help you do a lot of heavy lifting. A lot of the times, I/O is the bottleneck for my Python scripts. By moving the processing just that one step closer to the data, read-write speed has been improved in a different order of magnitude. Best of all, whatever your requirements are, you will always be able to find a suitable database: MySQL, PostgreSQL, MongoDB, CouchDB, etc.</p><p id="8722">More on this topic here:</p><div id="5e81" class="link-block"> <a href="https://towardsdatascience.com/databases-101-introduction-to-databases-for-data-scientists-ee18c9f0785d"> <div> <div> <h2>Databases 101: Introduction to Databases for Data Scientists</h2> <div><h3>How to get started with the world of databases?</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*YL3oblyvva_2Uj3z0zhydA.png)"></div> </div> </div> </a> </div><h2 id="af7e">Before you go</h2><p id="aa9b">These are some of my medium blogs on data science and python that you may want to check out:</p><div id="fe23" class="link-block"> <a href="https://towardsdatascience.com/efficient-implementation-of-conditional-logic-on-pandas-dataframes-4afa61eb7fce"> <div> <div> <h2>Efficient Conditional Logic on Pandas DataFrames</h2> <div><h3>Time to stop being too dependent on .iterrows() and .apply()</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*7SOe2uC6xIast1ON)"></div> </div> </div> </a> </div><div id="a54c" class="link-block"> <a href="https://towardsdatascience.com/cookiecutter-plugin-for-jupyter-easily-organise-your-data-science-environment-a56f83140f72"> <div> <div> <h2>Essential Jupyter Extension for Data Science Set Up</h2> <div><h3>A custom Jupyter extension that helps organise your project folders</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*Q7T8lq8ibhUAeArb)"></div> </div> </div> </a> </div><div id="020f" class="link-block"> <a href="https://towardsdatascience.com/mastering-root-searching-algorithms-in-python-7120c335a2a8"> <div> <div> <h2>Efficient Root Searching Algorithms in Python</h2> <div><h3>Implementing efficient searching algorithms for finding roots, and optimisation in Python</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*HMgA8rFCUIxc6JPe)"></div> </div> </div> </a> </div><h2 id="98a9">That’s it</h2><p id="d6a9">Hopefully, you have found these tips useful for polishing your data science endeavours. Let me know if the comments how you find these tricks.</p><p id="ab6f">Adios!</p><div id="4ca7" class="link-block"> <a href="https://www.linkedin.com/in/louis-chan-b55b9287/"> <div> <div> <h2>Louis Chan - Director, Data Science - FTI Consulting | LinkedIn</h2> <div><h3>Ambitious, curious and creative individual with a strong belief in inter-connectivity between branches knowledge and a…</h3></div> <div><p>www.linkedin.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/)"></div> </div> </div> </a> </div></article></body>

7 Easy Ways for Improving Your Data Science Workflow

Tips I have learnt from working as a Data Scientist

It is not easy to ace your data science game, but there are definitely ways we can always improve the way we practise data science. In this blog, we are going to talk about 7 easy ways that can quickly improve your data science workflow. These are tips that I have learnt the hard way after I have left Uni. Hopefully, this can help you organise your environment, code better, and find a more suitable stack for your data science projects!

1. Organise your project directory

Nothing is worse than having a messy folder with Untitled.ipynb up to Untitled9999.ipynb, data csv scattered everywhere in the folder, and a bunch of cache like .ipynb_checkpoints. A good first step to setting up your project folder would immensely smoother your daily workflow. Some common tools that help set up an organised folder are:

Cookiecutter: A Python package that helps you set up a standard folder directory. Check here for a cookiecutter-inspired plugin that is directly accessible from your Jupyter development environment.
Miniconda: A barebones version of Anaconda. If you do not need the GUI or the other applications that come with the Anaconda launcher, Miniconda is the way to go.
requirements.txt or environment.yml: This helps recreate the development environment in no time
.gitignore file: This helps keep your git repository clean and uncluttered by ignoring files that do not need to be pushed to your repo. e.g. temporary files, notebook checkpoints, venv folder etc.

More on this topic here:

Essential Jupyter Extension for Data Science Set Up

A custom Jupyter extension that helps organise your project folders

towardsdatascience.com

2. Clean up your Python Environment

Most of the data science medium blog post would tell you, create a virtual environment for your project, but rarely do they tell you the important to keeping it clean. Often when you create your development environment, there will be a bunch of IPython related packages installed which are not required for the production environment. To make your workflow smoother, you can either:

Multiple requirements.txt or enviornment.yml: Having separate configuration for dev and prod may be useful, but you would need to update both when you add in new packages
pipreqs: A Python package that cleans up your production requirements.txt by keeping only the packages that your code imports by running pipreqs /path/to/your/project/folder.
pipdeptree: A Python package that helps you understand the dependencies of your Python environment by running pipdeptree where you have your requirements.txt.

More on this topic here:

Better Python dependency while packaging your project

I have been cooking this blog topic idea for a long time. I did a lot of searching, reading and trying while working on…

medium.com

3. Make better use of Python’s Standard Libraries

Do not reinvent the wheel! More than often when you feel like there are some clunky code that you need to code, you are recreating some built-in functions that Python provides.

Example 1: collection.Counter for counting occurrences

# instead of this
counter = dict()
for item in my_list:
    counter[item] = counter.get(item, 0) + 1

# do this
from collections import Counter
counter = Counter(my_list)

Example 2: itertools.chain.from_iterable for joining list of lists

# instead of this
joined_list = []
for sub_list in my_list_of_list:
    joined_list.extend(sub_list)

# do this
from itertools import chain
joined_list = list(chain.from_iterable(my_list_of_list))

Example 3: collections.namedtuple as a replacement of an immutable class

# instead of this
class Color:
    def __init__(self, r, g, b):
        self.r, self.g, self.b = r, g, b

# or this
RED, GREEN, BLUE = 0, 1, 2
c = (0, 255, 120) # lime green
red_value = c[RED]

# do this
from collections import namedtuple
Color = namedtuple('Color', ['r', 'g', 'b'])
c = Color(0, g=255, b=120) # lime green
red_value = c.r

Example 4: bisect.insort for maintaining a sorted list

# just do this
from bisect import insort
my_sorted_list = []
for i in (1, 324, 52, 568, 24, 12, 8):
    insort(my_sorted_list, i)

Another common pitfall is using overkilling data structures. For example, if you never change the content of a list once defined, a tuple would save you a lot of memory; or if you only use the first or last element of a list, then go for a deque.

The Python Standard Library - Python 3.9.0 documentation

While The Python Language Reference describes the exact syntax and semantics of the Python language, this library…

docs.python.org

4. Streamline your code when packaging Notebooks

We all know it ourselves, development notebooks are some of the worst scripts in the world when it comes to readability, scalability and maintainability. There are generally a couple of rules of thumb when it comes to packaging your notebooks into a Python package:

Identify repeating code snippets and group them into a function: For example, instead of writing a regex expression for keyword matching every time, I have this snippet:

Remove redundant cells, lines and variables: Yes, we are talking about df_temp = df.copy(), gc.collect(), temp = df.some_column.unique()
Moving all your functions and imports to the beginning of the notebook: A good lesson that C taught me when I was at Uni was that define everything you need upfront, and you will never need to worry about where to find them
Untangle and sort cells by sequence of execution: The last thing you want is to remember when did you jump from the last cell of the notebook to the very first

5. Inherit existing Abstract Classes whenever possible

Object-oriented programming vs functional programming will forever be a debate when it comes to organising your code. But bottom line, using either one is better than using none. When it comes to data science, there are a number of packages that are organised in a OOP manner, e.g. sci-kit learn, nltk, spacy, and more. If you manage to package your script into classes that extend their abstract classes, it would help organise your code better, and allow better integration with your data science pipeline.

For example, based on my previous snippet of regex expression construction, I would like to construct a tokeniser that returns only the tokens captured by the function. By inheriting nltk’s TokenizerI, my RegexpEntityTokenizer now has the same structure as say nltk’s PunktSentenceTokenizer. This means that I can now pass my custom tokenizer to a sklearn Pipeline for a seamless workflow.

More on this topic here:

Object-oriented programming for data scientists: Build your ML estimator

Implement some of the core OOP principles in a machine learning context by building your own Scikit-learn-like…

towardsdatascience.com

6. Stop pickling everything

Yes, I know. Pickling DataFrames and other objects are convenient; you can basically serialise everything in your python script into a .pkl file. However, pickle’s convenience comes with several weak links:

Not designed for speed: Pickle is designed for any object, making it slower than other more specialised serialisation
Not secure: Unpacking a pickle file can execute some arbitrary code that has been hidden in the pickle file. Imagine if your ex has some birthday pranks prepared for you.
Not portable: Not all pickled data is compatible between different versions of Python.

So remember, pickle is not the only way to saving your data and progress. Below are some of the alternatives:

cPickle: Bottom line, look into cPickle, which is pickle implemented in C, making it over 1,000 times faster than pickle.
JSON: JSON serialise your data a lot faster than pickle. On top of that, JSON makes storing and editing dictionaries (or kwargs for configuring your models) possible outside Python in a text editor.
NumPy: If you are serialising a well-defined structure that you can fit in a numpy.ndarray, NumPy’s np.save and np.memmap are some faster options
Joblib: This should be no stranger to anyone that has been developing machine learning models. If your object contains large np.ndarray, then joblib may be suitable for you. It has basically the same, but a bit simpler, interface as pickle, so you should have no problem using it. However, do note that joblib comes with the same security concern as pickle.
h5py: No stranger to those who has played with say tensorflow or keras before. HDF5 arrays can hold large amount of compressed numerical data. It is true that having a hierarchy data structure with compressed data means a bit of a learning curve, and also a bit harder to query your datasets; but on the flip side, you will get very efficient data I/O from it.

7. Consider using a Database

Last but never least, this would be a big jump and potentially change your game plan. But trust me, it is worth it after experiencing the conversion myself. Check to see if you have experienced any of the following:

Cannot speed up your python script anymore on your huge pandas dataframe
Saving and loading your dataset is taking longer and longer
All of your data sits at the same place, whether it’s raw, external, or processed
New data keeps coming in and are saved as csvs

These are all signs that a database will make your life a lot easier. And once you have a database in place, SQL would help you do a lot of heavy lifting. A lot of the times, I/O is the bottleneck for my Python scripts. By moving the processing just that one step closer to the data, read-write speed has been improved in a different order of magnitude. Best of all, whatever your requirements are, you will always be able to find a suitable database: MySQL, PostgreSQL, MongoDB, CouchDB, etc.

More on this topic here:

Databases 101: Introduction to Databases for Data Scientists

How to get started with the world of databases?

towardsdatascience.com

Before you go

These are some of my medium blogs on data science and python that you may want to check out:

Efficient Conditional Logic on Pandas DataFrames

Time to stop being too dependent on .iterrows() and .apply()

towardsdatascience.com

Essential Jupyter Extension for Data Science Set Up

A custom Jupyter extension that helps organise your project folders

towardsdatascience.com

Efficient Root Searching Algorithms in Python

Implementing efficient searching algorithms for finding roots, and optimisation in Python

towardsdatascience.com

That’s it

Hopefully, you have found these tips useful for polishing your data science endeavours. Let me know if the comments how you find these tricks.

Adios!