Summary
The web content introduces Python's collections
module, emphasizing the utility of NamedTuple
, Counter
, and defaultdict
for data scientists to write cleaner, more maintainable code.
Abstract
The article on the undefined
website highlights the undervalued Python standard library, particularly the collections
module, which offers specialized container datatypes. It underscores the benefits of using NamedTuple
for feature engineering, making code more understandable by replacing index-based feature access with named attributes. The Counter
class is presented as a powerful tool for counting occurrences, providing quick insights into the frequency of elements. Lastly, defaultdict
is recommended for handling dictionary access without errors, simplifying the accumulation of counts or lists for keys. The article encourages readers to leverage these tools to enhance code quality and maintainability.
Opinions
NamedTuple
extremely useful for organizing features in a human-understandable way, avoiding the confusion of index-based lists.Counter
is highly praised for its simplicity and effectiveness in counting tasks, which are common in data science.defaultdict
, highlighting its ability to streamline code by eliminating the need for error-prone dictionary access and initialization.Python has a lot of great libraries included out of the box. One of which is collections. The collections module provides “high-performance container datatypes” which provide alternatives to the general-purpose containers dict, list, set, and tuple. I’d love to introduce you to three of these datatypes and in the end, you’ll be wondering how you ever lived without them.
I can’t overstate how useful namedtuples can be for data scientists. Let me know if this scenario sounds familiar: You are doing feature engineering and since you love lists, you just keep appending the features to a list, which you then feed into your machine learning model. Soon, you might have hundreds of features and that’s when things get messy. You no longer remember which feature refers to which index in your list. Worse, when someone else looks at your code, they have no idea what is going on with this monstrous list of features.
Enter NamedTuples to save the day.
With just a few extra lines of code, your crazy messy list will be restored to order. Let’s take a look
If you were to run this code, it would print out “22”, the age you stored in your row. This is amazing! Now you don’t have to use indexes to access your features but instead can use human-understandable names. This makes your code significantly more maintainable and clean.
Counter is aptly named — its main function is counting. This sounds simple, but it turns out that data scientists often have to count things, so it can be very handy.
There are a few ways it can be initialized, but I most often have a list of values and feed that list in as so
If you were to run the above code (which you can by using this awesome tool), you would get the following output:
[(22, 5), (25, 3), (24, 2), (30, 2), (35, 1), (40, 1), (11, 1), (45, 1), (15, 1), (16, 1), (52, 1), (26, 1)]
A list of tuples ordered by the most common where the tuple first contains the value and then the count. So we can now quickly see that 22 is the most common age with 5 occurrences and that there is a long-tail of ages with only 1 count. Nice!
This is one of my favorites. DefaultDict is a dictionary that is initialized with a default value when each key is encountered for the first time. Here is an example
This returns
defaultdict(<type 'int'>, {'a': 4, ' ': 8, 'c': 1, 'e': 2, 'd': 2, 'f': 2, 'i': 1, 'h': 1, 'l': 1, 'o': 2, 'n': 1, 's': 3, 'r': 2, 'u': 1, 't': 3, 'x': 1})
Normally, when you try and access a value not in a dictionary it throws an error. There are other ways to handle this, but they add unnecessary code when you have a default value you want to assume. In our example above, we initialize the defauldict with int. That means on first access, it will assume a zero, so we can easily just keep adding up the counts of all of the characters. Simple and clean. Another common initialization is list, which allows you to immediately start appending values upon first access.
Now that you know about the collections library and some of its awesome features, go use them! You’ll be surprised by how often they are useful and how much nicer your code will be. Enjoy!
This article can also be found here.
1-page. Well-formatted.
# No cheating pls!!
DSPy Paradigm: Let’s program — not prompt — LLMs
If you are not a member, you can read the full article by clicking here.
Start learning Python easily, fast, and while having fun