avatarTyler Folkman

Summary

The web content introduces Python's collections module, emphasizing the utility of NamedTuple, Counter, and defaultdict for data scientists to write cleaner, more maintainable code.

Abstract

The article on the undefined website highlights the undervalued Python standard library, particularly the collections module, which offers specialized container datatypes. It underscores the benefits of using NamedTuple for feature engineering, making code more understandable by replacing index-based feature access with named attributes. The Counter class is presented as a powerful tool for counting occurrences, providing quick insights into the frequency of elements. Lastly, defaultdict is recommended for handling dictionary access without errors, simplifying the accumulation of counts or lists for keys. The article encourages readers to leverage these tools to enhance code quality and maintainability.

Opinions

  • The author expresses that data scientists will find NamedTuple extremely useful for organizing features in a human-understandable way, avoiding the confusion of index-based lists.
  • Counter is highly praised for its simplicity and effectiveness in counting tasks, which are common in data science.
  • The author is enthusiastic about defaultdict, highlighting its ability to streamline code by eliminating the need for error-prone dictionary access and initialization.
  • There is a strong recommendation for readers to incorporate these collections tools into their coding practice, suggesting that they will find them indispensable once familiar.
  • The article suggests that using these advanced data structures can significantly improve the readability and elegance of Python code in data science applications.

Video Tutorial

The Most Undervalued Standard Python Library

Collections for data scientists

Python has a lot of great libraries included out of the box. One of which is collections. The collections module provides “high-performance container datatypes” which provide alternatives to the general-purpose containers dict, list, set, and tuple. I’d love to introduce you to three of these datatypes and in the end, you’ll be wondering how you ever lived without them.

NamedTuple

I can’t overstate how useful namedtuples can be for data scientists. Let me know if this scenario sounds familiar: You are doing feature engineering and since you love lists, you just keep appending the features to a list, which you then feed into your machine learning model. Soon, you might have hundreds of features and that’s when things get messy. You no longer remember which feature refers to which index in your list. Worse, when someone else looks at your code, they have no idea what is going on with this monstrous list of features.

Enter NamedTuples to save the day.

With just a few extra lines of code, your crazy messy list will be restored to order. Let’s take a look

If you were to run this code, it would print out “22”, the age you stored in your row. This is amazing! Now you don’t have to use indexes to access your features but instead can use human-understandable names. This makes your code significantly more maintainable and clean.

Counter

Counter is aptly named — its main function is counting. This sounds simple, but it turns out that data scientists often have to count things, so it can be very handy.

There are a few ways it can be initialized, but I most often have a list of values and feed that list in as so

If you were to run the above code (which you can by using this awesome tool), you would get the following output:

[(22, 5), (25, 3), (24, 2), (30, 2), (35, 1), (40, 1), (11, 1), (45, 1), (15, 1), (16, 1), (52, 1), (26, 1)]

A list of tuples ordered by the most common where the tuple first contains the value and then the count. So we can now quickly see that 22 is the most common age with 5 occurrences and that there is a long-tail of ages with only 1 count. Nice!

DefaultDict

This is one of my favorites. DefaultDict is a dictionary that is initialized with a default value when each key is encountered for the first time. Here is an example

This returns

defaultdict(<type 'int'>, {'a': 4, ' ': 8, 'c': 1, 'e': 2, 'd': 2, 'f': 2, 'i': 1, 'h': 1, 'l': 1, 'o': 2, 'n': 1, 's': 3, 'r': 2, 'u': 1, 't': 3, 'x': 1})

Normally, when you try and access a value not in a dictionary it throws an error. There are other ways to handle this, but they add unnecessary code when you have a default value you want to assume. In our example above, we initialize the defauldict with int. That means on first access, it will assume a zero, so we can easily just keep adding up the counts of all of the characters. Simple and clean. Another common initialization is list, which allows you to immediately start appending values upon first access.

Photo by Hitesh Choudhary on Unsplash

Go Write Cleaner Code

Now that you know about the collections library and some of its awesome features, go use them! You’ll be surprised by how often they are useful and how much nicer your code will be. Enjoy!

This article can also be found here.

Programming
Data Science
Artificial Intelligence
Machine Learning
Video Tutorial
Recommended from ReadMedium