Python Pickle : From Whys and Hows to Dos and Don’ts

All about Python pickle module in one place

Source https://jerifink.com/a-pickled-history/

Introduction
Pickling in Python • What is it? • Why do we need it? • How does it work? • Types supported in pickling • Benefits • Drawbacks
Usage of pickle module
Custom pickling • Approach 1: __setstate__and __getstate__ • Approach 2: __reduce__ and __reduce_ex__ • Which approach to use?
Pickled extras • Python 2 — Python 3 pickle protocols compatibility • Malware execution through pickle loaded data

Introduction

Hey Python enthusiasts! In this article, we’re taking a joyride into the Python’s built-in module Pickle, exploring its capabilities, benefits, and potential pitfalls.

Pickle, offering a seamless way to serialize and deserialize objects, facilitates communication between different Python processes and plays a crucial role in the Python ecosystem. Its binary serialization format, coupled with cross-version compatibility, makes it a go-to solution for preserving the state of Python objects. However, with great power comes great responsibility, and we’ll also unravel the security considerations and limitations that developers should be mindful of when harnessing Pickle’s capabilities.

Pickling in Python

What is it?

Python’s Pickle module is a powerful and flexible serialization and deserialization library that allows objects to be converted into a byte stream. Serialization is the process of converting complex data structures, such as objects or data in memory, into a format that can be easily stored or transmitted. Pickle provides a way to serialize Python objects into a binary format, and later deserialize them back into their original form.

Why do we need it?

The Python Pickle module serves several important purposes, making it a valuable tool in various scenarios. Here are some key reasons why the Pickle module is useful:

Object Serialization: Pickle allows you to serialize complex Python objects into a binary format, which can be stored persistently in files or databases. This is useful for saving the state of your program or data structures and share it across different Python processes or even distinct programs at all. Pickle can serialize and deserialize custom Python classes and objects, including instances of user-defined classes. This makes it versatile for handling a wide range of data structures.
Efficient Data Storage: Pickle’s binary format is more compact and efficient than human-readable formats like JSON or XML. This makes it well-suited for storing large amounts of data, especially when file size and performance are critical considerations.
Simple Interface: The Pickle module provides a straightforward interface for serializing and deserializing objects. It requires minimal code to store and retrieve complex data structures, making it easy to use for developers.
Object Relationships: Pickle preserves object references during serialization and deserialization. This means that if multiple objects reference the same object, the relationships are maintained in the serialized form.

How does it work?

The Python Pickle module works by serializing (converting to a byte stream) and deserializing (reconstructing from a byte stream) Python objects. The process involves encoding the object’s state into a binary format that can be later reconstructed to its original form. Here’s a simplified overview of how Pickle works:

Serialization (pickling):

Pickle starts by traversing the object to be serialized, whether it’s a simple data type like a string or number, or a complex data structure like a list or dictionary. If the object being serialized contains other objects (e.g., a list with nested dictionaries), Pickle recursively serializes those objects, ensuring the entire structure is captured in the byte stream.
As Pickle traverses the object, it constructs a byte stream representation of the object’s state. This byte stream contains information about the object’s type, attributes, and the relationships between different objects. During the process, Pickle keeps track of object references to ensure that if an object is referenced multiple times, it is serialized only once. This helps preserve relationships and avoids redundant data.
Pickle has mechanisms to handle custom classes by serializing the class name and its attributes. Custom classes need to implement special methods like __getstate__() and __setstate__() if customization is required during the serialization process.
The final result is a binary representation of the object’s state, stored in a byte stream. This byte stream can be saved to a file, sent over a network, or stored in a database.

Deserialization (unpickling):

To reconstruct the object, Pickle reads the byte stream, interpreting the binary data to understand the structure and content of the serialized data. Pickle reconstructs the object based on the information stored in the byte stream. This involves creating instances of classes, setting attribute values, and building data structures.
If the serialized data includes references to other objects, Pickle recursively deserializes those objects to reconstruct the complete structure.
For custom classes, Pickle uses the class name and attribute information to instantiate the class and set its state. Custom classes can implement __reduce__() to provide additional control over the deserialization process.
Pickle ensures that references between objects are restored correctly during deserialization. If an object is referenced multiple times, the deserialization process ensures that all references point to the same reconstructed object.

Types supported in pickling

Python’s Pickle module supports serialization and deserialization of various primitive built-in types. Here’s an exhaustive list of the primitive Python built-in types that Pickle can handle:

int: Integer type, supports both signed and unsigned integers.
float: Floating-point type for decimal numbers.
complex: Complex number type.
str: String type for Unicode text.
bytes: Immutable sequence of bytes.
bytearray: Mutable sequence of bytes.
bool: Boolean type, representing either True or False.
NoneType: Type of the None object.
list: Mutable sequence, represented by square brackets.
tuple: Immutable sequence, represented by parentheses.
range: Represents an immutable sequence of numbers.
set: Unordered, mutable set of unique elements.
frozenset: Immutable set of unique elements.
dict: Mutable mapping of keys to values, represented by curly braces.
memoryview: Memory view object.
slice: Represents a slice object used for slicing sequences.
iter: Iterator object.
function: Function object.
builtin_function_or_method: Built-in function or method object.
module: Module object.
type: Type object.
code: Code object.
file: File object.
datetime.date: Represents a date (year, month, day).
datetime.time: Represents a time (hour, minute, second, microsecond).
datetime.datetime: Represents both date and time.
Custom classes and objects can be pickled if they implement the necessary methods for serialization and deserialization (more on that in a bit)

Benefits

The Python Pickle module offers several benefits, making it a valuable tool for serialization and deserialization of Python objects. Here are the key advantages of using Pickle:

Serialization: Pickle can serialize and deserialize complex data structures, including nested objects, custom classes, and instances of built-in classes. This versatility makes it suitable for serializing diverse and complex data structures.
Custom Class Support: Pickle supports the serialization and deserialization of custom Python classes by allowing them to implement special methods such as __reduce__. This enables developers to control the serialization process for custom objects.
Binary Format: Pickle uses a binary format for serialization, which is more compact and efficient compared to human-readable formats like JSON or XML. This makes Pickle well-suited for storing and transmitting large amounts of data, minimizing storage space and network bandwidth.
Cross-Platform Compatibility: Pickle’s binary format is platform-independent, allowing serialized data to be moved across different platforms and operating systems seamlessly. This makes it suitable for scenarios where data needs to be exchanged between systems with varying architectures.
Cross-Version Compatibility: Pickle is designed to be compatible across different Python versions. This means that objects serialized in one version of Python can be deserialized in another version without compatibility issues. This feature is particularly useful in multi-version Python environments.
Object Relationships Preservation: Pickle preserves relationships between objects during serialization and deserialization. If multiple objects reference the same object, Pickle ensures that these references are maintained in the serialized form, allowing for accurate reconstruction of the object graph. Also, Pickle efficiently handles references to objects, ensuring that an object referenced multiple times is serialized only once. This helps reduce redundancy in the serialized data.
Object State Preservation: Pickle captures and preserves the internal state of Python objects during serialization. When objects are deserialized, their state is accurately reconstructed, ensuring that the behavior of the program remains consistent.

Drawbacks

While the Python Pickle module offers numerous benefits, it also comes with certain drawbacks and considerations that users should be aware of. Here are the key drawbacks of using Pickle:

Security Risks: One of the significant drawbacks of Pickle is its potential security risks. Unpickling data from untrusted or unauthenticated sources can lead to the execution of arbitrary code. This makes Pickle unsuitable for handling data from untrusted or insecure sources.
Compatibility Issues: While Pickle is designed to be compatible across different Python versions, there may still be challenges when dealing with objects that have undergone changes between Python releases. This can result in compatibility issues, especially when serializing and deserializing objects across different versions.
Limited Interoperability: Pickle’s binary format is specific to Python, limiting its interoperability with other programming languages. If data needs to be exchanged between Python and non-Python systems, using a more standardized format like JSON or XML might be a better choice.
Version Dependency: Although Pickle aims for cross-version compatibility, there could be scenarios where new features or changes introduced in a later Python version may not be supported in an older version. This version dependency can create challenges in certain environments.
Lack of Human Readability: Pickle’s binary format is not human-readable, making it challenging to inspect the contents of a serialized file without using Pickle itself. This lack of readability can be a drawback in scenarios where human inspection of the data is necessary.
Limited Support for External Resources: Pickle is primarily designed for serializing Python objects and their internal state. It may not handle well objects that rely on external resources, such as file handles or network connections. Objects requiring special handling during serialization may need additional customization.
Potential for Large Serialized Files: Pickle’s binary format, while efficient, may result in relatively large serialized files compared to more compact formats like JSON. This could be a consideration in scenarios where minimizing storage space is critical.
Performance Considerations: Pickle may not always be the fastest serialization option. Depending on the use case and performance requirements, other serialization formats or libraries might be more suitable.
Not Suitable for All Data Types: While Pickle supports a wide range of data types, certain types (e.g., file objects, database connections) may not be easily pickled, and handling them requires additional care.

Ok, so enough with the theory, right? Let’s see some practical examples!

Usage of pickle module

First let’s see how we can use the pickle module to store and load standard built-in Python objects.

import pickle

# Sample data to be serialized
student_data = {
    'name': 'Alice',
    'age': 22,
    'grades': {'math': 95, 'history': 87, 'english': 91}
}

# Serialize the data using pickle.dumps()
serialized_data = pickle.dumps(student_data)

# Save the serialized data to a file
with open('student_data.pkl', 'wb') as file:
    file.write(serialized_data)

# Read the serialized data from the file
with open('student_data.pkl', 'rb') as file:
    loaded_data = pickle.load(file)

# Step 4: Display the loaded data
print("Original Data:")
print(student_data)

print("\nLoaded Data:")
print(loaded_data)

So what happens in that script?

We create a sample dictionary (student_data) representing information about a student.
The pickle.dumps() function is used to serialize the data into a binary format.
The serialized data is then saved to a file named ‘student_data.pkl’ using the open() function in binary write mode ('wb').
Next, we read the serialized data from the file using the open() function in binary read mode ('rb').
The pickle.load() function is used to deserialize the data, reconstructing the original Python object.
Finally, we print both the original data and the loaded data to compare them.

The above script prints out the following:

➜  example python3.11 pickle_demo.py 
Original Data:
{'name': 'Alice', 'age': 22, 'grades': {'math': 95, 'history': 87, 'english': 91}}

Loaded Data:
{'name': 'Alice', 'age': 22, 'grades': {'math': 95, 'history': 87, 'english': 91}}

So, we used pickle module to store a python dictionary into a pickle file and then loaded the file to get back the information for printing them. It seems that it worked, the print logs are identical.

So how the pickle file looks like in our editor?

Well, expected, right?

As mentioned earlier, the data are stored in binary format, which makes them practically non-human-readable. We can still see some words that are saved intact but the whole thing is not readable in general.

Let’s see the same example, but this time with a custom class instance for pickle storage.

import pickle

# Define a custom class
class Student:
    def __init__(self, name, age, grades):
        self.name = name
        self.age = age
        self.grades = grades

    def display_info(self):
        print(f"Name: {self.name}, Age: {self.age}")
        print("Grades:")
        for subject, grade in self.grades.items():
            print(f"  {subject}: {grade}")

# Create an instance of the custom class
alice = Student(name='Alice', age=22, grades={'math': 95, 'history': 87, 'english': 91})

# Serialize the custom class instance
serialized_instance = pickle.dumps(alice)

# Save the serialized instance to a file
with open('student_instance.pkl', 'wb') as file:
    file.write(serialized_instance)

# Read the serialized instance from the file
with open('student_instance.pkl', 'rb') as file:
    loaded_instance = pickle.load(file)

# Display information from the loaded instance
print("Original Instance:")
alice.display_info()
print(f"Type of original instance is {type(alice)}")

print("\nLoaded Instance:")
loaded_instance.display_info()
print(f"Type of loaded instance is {type(loaded_instance)}")

The concept here is the same, but instead of a mere python dictionary we are using a custom class:

We define a custom class Student with attributes for the student's name, age, and grades. The class includes a method display_info() to print the student's information.
An instance of the custom class (alice) is created with sample data.
The pickle.dumps() function is used to serialize the custom class instance.
The serialized instance is saved to a file named ‘student_instance.pkl’ using the open() function in binary write mode ('wb').
Next, we read the serialized instance from the file using the open() function in binary read mode ('rb').
The pickle.load() function is used to deserialize the data, reconstructing the original custom class instance.
Finally, we display information from both the original instance and the loaded instance to ensure successful pickling and unpickling of the custom class.

The above script prints out the following:

➜  example python3.11 pickle_demo.py
Original Instance:
Name: Alice, Age: 22
Grades:
  math: 95
  history: 87
  english: 91
Type of original instance is <class '__main__.Student'>

Loaded Instance:
Name: Alice, Age: 22
Grades:
  math: 95
  history: 87
  english: 91
Type of loaded instance is <class '__main__.Student'>

So both prints are identical, showcasing that data of custom class instance are saved intact and still preserve all the types involved, cool huh?

Custom pickling

Ok, so far, we have seen how to use the pickle module to dump and load Python objects back and fourth. But we have let the module and pre-defined methods do the heavy-lifting for us.

Now, let’s see how we can build our own custom pickling process to enable us to store and load pretty-much everything we like, just like the way we like it!

Approach 1: setstate and getstate

The __getstate__ and __setstate__ methods provide a way to control what data is stored during pickling and how the object is reconstructed during unpickling.

The __getstate__ method is called when an object is about to be pickled. It should return the object’s state as a dictionary. The keys of this dictionary represent the attributes or information that you want to pickle, and the values are the corresponding values of those attributes.
The __setstate__ method is called when an object is being unpickled. It receives the dictionary returned by __getstate__ and should be used to set the object's state.

Basically, you can think of that approach like this: Whatever comes out of __getstate__ method during pickling, is received by __setstate__ method as an argument during unpickling. So the return value of the __getstate__ is the receiving argument of __setstate__ .

Let’s see a practical example:

from datetime import datetime
import pickle

class Student:
    def __init__(self, name, age, grades):
        self.name = name
        self.age = age
        self.grades = grades

    def __getstate__(self):
        # Return a dictionary representing the object's state
        pickling_timestamp = datetime.utcnow().isoformat()
        print(f"Pickling object at {pickling_timestamp}")
        return {
            'name': self.name,
            'age': self.age,
            'grades': self.grades,
            'custom_pickle_timestamp': pickling_timestamp
        }

    def __setstate__(self, state):
        # Set the object's state based on the provided dictionary
        self.name = state['name']
        self.age = state['age']
        self.grades = state['grades']
        print(f"Restoring pickled object state. Object was pickled at: {state['custom_pickle_timestamp']}")

    def display_info(self):
            print(f"Name: {self.name}, Age: {self.age}")
            print("Grades:")
            for subject, grade in self.grades.items():
                print(f"  {subject}: {grade}")

# Create an instance of the custom class
alice = Student(name='Alice', age=22, grades={'math': 95, 'history': 87, 'english': 91})

# Serialize the custom class instance
serialized_instance = pickle.dumps(alice)

# Deserialize the custom class instance
loaded_instance = pickle.loads(serialized_instance)

# Display information from the original and loaded instances
print("Original Instance:")
alice.display_info()

print("\nLoaded Instance:")
loaded_instance.display_info()

Explanation:

Just like previous examples, we defined the very same custom class for students, with the same attributes and display_info method.
This time we defined the methods __setstate__ and __getstate__ , hence overriding the default process of the pickle module.
During serialization (pickling), meaning when pickle.dumps method is being called, the __getstate__ method is called. In our definition, we return a dictionary with all the instance attributes, plus one extra key-value pair with the current timestamp, just to showcase the timestamp of pickling process, which we also print out in the console. This dictionary returned, will be used during deserialization (unpickling) to restore the information of the pickled object.
During deserialization (unpickling), meaning when pickle.loads method is being called, the __setstate__ method is called. In our definition, the method is receiving a dictionary with the state of the object stored. This very information is used to reconstruct the object in its initial form, so setting the instance’s attributes etc. During the process we are also printing out the custom pickling timestamp that we stored in the object just to prove that the information is there in the pickled object, even though we are not using it further than logging.

This script will print out:

➜  example python3.11 pickle_demo.py
Pickling object at 2024-01-02T12:16:42.216102
Restoring pickled object state. Object was pickled at: 2024-01-02T12:16:42.216102
Original Instance:
Name: Alice, Age: 22
Grades:
  math: 95
  history: 87
  english: 91

Loaded Instance:
Name: Alice, Age: 22
Grades:
  math: 95
  history: 87
  english: 91

As expected, the information stayed intact during the process. Plus, we can see the logs coming from the defined methods during the pickle module methods execution, along with the saved timestamp of the process.

Approach 2: reduce and __reduce_ex__

Another approch for custom pickling process is to implement and override the method __reduce__ and optionally the method __reduce_ex__ .

__reduce__ method should return a tuple containing a callable (a function or a class) and a tuple of arguments. When the object is pickled, the callable and its arguments will be used to reconstruct the object.
__reduce_ex__ is an extended version of __reduce__ that allows specifying a protocol version for serialization. This method is optional, and if not provided, __reduce__ will be used.

from datetime import datetime
import pickle

class Student:
    def __init__(self, name, age, grades):
        self.name = name
        self.age = age
        self.grades = grades

    def __reduce__(self):
        # Return a tuple with a callable and its arguments for pickling
        print(f"Pickling object with custom __reduce__ method")
        return (self.__class__, (f"tweaked {self.name}", self.age + 10, self.grades))


    def display_info(self):
        # Method to display student information
        print(f"Name: {self.name}, Age: {self.age}")
        print("Grades:")
        for subject, grade in self.grades.items():
            print(f"  {subject}: {grade}")

# Create an instance of the custom class
alice = Student(name='Alice', age=22, grades={'math': 95, 'history': 87, 'english': 91})

# Serialize the custom class instance
serialized_instance = pickle.dumps(alice)

# Deserialize the custom class instance
loaded_instance = pickle.loads(serialized_instance)

# Display information from the original and loaded instances
print("Original Instance:")
alice.display_info()

print("\nLoaded Instance:")
loaded_instance.display_info()

Explanation:

Just like previous examples, we defined the very same custom class for students, with the same attributes and display_info method.
This time we defined the method __reduce__, hence overriding the default process of the pickling process.
During serialization (pickling), meaning when pickle.dumps method is being called, the __reduce__ method is called. In our definition, we return a callable, the class itself, with all the instance attributes necessary to reconstruct the instance using the callable, hence the __init__ method of the passed class callable. In our case we tweaked a bit the instance attributes for the sake of showcasing the customization.
During deserialization (unpickling), meaning when pickle.loads method is being called, the callable class is called. So, the __init__ method of the class is called with the arguments stored in the pickle. In our case, the tweaked stored arguments are retrieved from the pickled object and passed into the __init__ method to construct a new instance.

This script will print out:

➜  example python3.11 pickle_demo.py
Pickling object with custom __reduce__ method
Original Instance:
Name: Alice, Age: 22
Grades:
  math: 95
  history: 87
  english: 91

Loaded Instance:
Name: tweaked Alice, Age: 32
Grades:
  math: 95
  history: 87
  english: 91

As expected the __reduce__ method that we defined was ran during the pickling process and the tweaked instance attributes were stored in the pickle object. During unpickling, the tweaked attributes were used to reconstruct a student instance which printed out as expected.

The very same example can be further customized using the __reduce_ex__ method. This method gives us the capability to select a protocol for the pickle serialization process, so by default it receives an extra argument for protocol. Let’s see how we can use that to differentiate the pickle process:

from datetime import datetime
import pickle

class Student:
    def __init__(self, name, age, grades):
        self.name = name
        self.age = age
        self.grades = grades

    def __reduce__v2(self):
        # Return a tuple with a callable and its arguments for pickling
        print(f"Pickling object with custom __reduce___v2 method")
        return (self.__class__, (f"{self.name} v2", self.age, self.grades))

    def __reduce__v3(self):
        # Return a tuple with a callable and its arguments for pickling
        print(f"Pickling object with custom __reduce__v3 method")
        return (self.__class__, (f"{self.name} v3", self.age, self.grades))

    def __reduce__v4(self):
        # Return a tuple with a callable and its arguments for pickling
        print(f"Pickling object with custom __reduce__v4 method")
        return (self.__class__, (f"{self.name} v4", self.age, self.grades))

    def __reduce_ex__(self, protocol):
        # __reduce_ex__ is an extended version with protocol argument
        print(f"Pickling object with custom __reduce_ex__ method using protocol {protocol}")
        if protocol == 2:
            return self.__reduce__v2()
        elif protocol == 3:
            return self.__reduce__v3()
        elif protocol == 4:
            return self.__reduce__v4()


    def display_info(self):
        # Method to display student information
        print(f"Name: {self.name}, Age: {self.age}")
        print("Grades:")
        for subject, grade in self.grades.items():
            print(f"  {subject}: {grade}")

# Create an instance of the custom class
alice = Student(name='Alice', age=22, grades={'math': 95, 'history': 87, 'english': 91})

# Serialize the custom class instance
serialized_instance_v2 = pickle.dumps(alice, protocol=2)
serialized_instance_v3 = pickle.dumps(alice, protocol=3)
serialized_instance_v4 = pickle.dumps(alice, protocol=4)

# Deserialize the custom class instance
loaded_instance_v2 = pickle.loads(serialized_instance_v2)
loaded_instance_v3 = pickle.loads(serialized_instance_v3)
loaded_instance_v4 = pickle.loads(serialized_instance_v4)

# Display information from the original and loaded instances
print("Original Instance:")
alice.display_info()

print("\nLoaded Instance v2:")
loaded_instance_v2.display_info()

print("\nLoaded Instance v3:")
loaded_instance_v3.display_info()

print("\nLoaded Instance v4:")
loaded_instance_v4.display_info()

Explanation:

Just like previous examples, we defined the very same custom class for students, with the same attributes and display_info method.
This time we defined three versions of the method __reduce__, each with a related version name.
We also defined an override for the method __reduce_ex__ which receives a protocol parameter. Based on the protocol argument we are calling the related __reduce__v{protocol} method to pickle the instance.
During serialization (pickling), meaning when pickle.dumps method is being called, the __reduce_ex__ method is called and based on the protocol specified, the dedicated reduce function is called subsequently. This time we pass extra argument in the dumpsmethod to specify the version we want for the pickling process. Each process tweaks the name attribute of the instance in a different way for the sake of showcasing the customization.
During deserialization (unpickling), meaning when pickle.loads method is being called, the callable class is called. So, the __init__ method of the class is called with the arguments stored in the pickle. In our case, the tweaked stored arguments are retrieved from the pickled objects and passed into the __init__ method to construct a new instances.

This script will print out:

➜  example python3.11 pickle_demo.py
Pickling object with custom __reduce_ex__ method using protocol 2
Pickling object with custom __reduce___v2 method
Pickling object with custom __reduce_ex__ method using protocol 3
Pickling object with custom __reduce__v3 method
Pickling object with custom __reduce_ex__ method using protocol 4
Pickling object with custom __reduce__v4 method
Original Instance:
Name: Alice, Age: 22
Grades:
  math: 95
  history: 87
  english: 91

Loaded Instance v2:
Name: Alice v2, Age: 22
Grades:
  math: 95
  history: 87
  english: 91

Loaded Instance v3:
Name: Alice v3, Age: 22
Grades:
  math: 95
  history: 87
  english: 91

Loaded Instance v4:
Name: Alice v4, Age: 22
Grades:
  math: 95
  history: 87
  english: 91

Which approach to use?

While the __getstate__/__setstate__ methods and the __reduce__/__reduce_ex__ methods can technically be used all together, they are typically mutually exclusive. Combining them might lead to redundancy and unnecessary complexity in your code.

You could, in theory, use all of those methods in a class. However, it’s often unnecessary because they serve similar purposes, and using one approach is usually sufficient. Combining both approaches might lead to redundant code, and it’s generally advisable to stick with the approach that best fits your requirements.

But which one to choose? It depends on what you really need:

If you need fine-grained control over what gets serialized, then stick with the first approach and methods __getstate__and__setstate__
If you want a more general and flexible mechanism for pickling and unpickling , then choose the second approach with methods __reduce__ and/or __reduce_ex__ .

Pickled extras

Python 2 — Python 3 pickle protocols compatibility

As explained earlier, Python Pickle module can manifest versions of pickling based on protocols. This is not just a feature exposed for custom pickle processes but it is actually used to ensure compatibility across versions of Python and the Pickle module. That said, Pickle protocol versions can affect compatibility between different Python versions (Python 2 and Python 3). The protocol version determines the format and features supported during pickling and unpickling.

Python 2

In Python 2, the default protocol is 0, and the maximum supported protocol is 2.

Python 3

Python 3 introduced a binary protocol (protocol 3), which is more efficient and produces smaller pickled objects compared to the text-based protocols used in Python 2.

In Python 3, the default protocol is 3, and the maximum supported protocol is typically the latest version available in that Python release.

Cross-Version Compatibility

If you are exchanging pickled data between Python 2 and Python 3, choosing a protocol that is supported by both versions is crucial.

If you are pickling data in Python 3 and need compatibility with Python 2, it’s generally advisable to use protocol 2, which is the highest protocol supported by Python 2. This ensures that the pickled data can be successfully loaded by both Python 2 and Python 3.

Let’s see that with some examples

Not compatible pickle protocols

Let’s use Python 3 to store a pickled object and Python 2 to read it back while not specifying a pickle protocol.

import pickle

# Sample data to be serialized
student_data = {
    'name': 'Alice',
    'age': 22,
    'grades': {'math': 95, 'history': 87, 'english': 91}
}

# Serialize the data using pickle.dumps()
serialized_data = pickle.dumps(student_data)

# Save the serialized data to a file
with open('student_data.pkl', 'wb') as file:
    file.write(serialized_data)

Executing the above script to store a pickled object in Python2.

Then reading the file and try to load the pickled object with Python3 and this script:

import pickle

# Read the serialized data from the file
with open('student_data.pkl', 'rb') as file:
    loaded_data = pickle.load(file)


print("\nLoaded Data:")
print(loaded_data)

This will print out:

➜  example python3.11 pickle_demo.py
➜  example python2.7 pickle_read.py
Traceback (most recent call last):
  File "pickle_read.py", line 5, in <module>
    loaded_data = pickle.load(file)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1384, in load
    return Unpickler(file).load()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 892, in load_proto
    raise ValueError, "unsupported pickle protocol: %d" % proto
ValueError: unsupported pickle protocol: 4

Pickling the object using Python 3 runtime, used the default protocol for this version, which is apparently protocol 4. When we tried to read the very same file and unpickle it using Python 2, the runtime returned with an error, since Python 2 does only support up to protocol version 2 of the Pickle module.

Compatible pickle protocols

Let’s now try the same, but this time specify a protocol while pickling our data in Python 3:

import pickle

# Sample data to be serialized
student_data = {
    'name': 'Alice',
    'age': 22,
    'grades': {'math': 95, 'history': 87, 'english': 91}
}

# Serialize the data using pickle.dumps()
serialized_data = pickle.dumps(student_data, protocol=2)

# Save the serialized data to a file
with open('student_data.pkl', 'wb') as file:
    file.write(serialized_data)

We make no changes in the script that reads the pickled object and execute both scripts:

➜  example python3.11 pickle_demo.py
➜  example python2.7 pickle_read.py 

Loaded Data:
{u'age': 22, u'grades': {u'english': 91, u'math': 95, u'history': 87}, u'name': u'Alice'}

As expected, there is no error this time. Upon pickling we used the protocol 2 , which is the highest protocol supported by Python 2 Pickle module. Hence, when reading the file and unpickling in Python 2, the object was compatible and we were able to unpickle it correctly.

Malware execution through pickle loaded data

As already stated in the drawbacks section, unpickling objects in Python can pose security risks when loading data from untrusted sources. The security concern arises from the fact that the pickle module in Python is a powerful serialization tool that can execute arbitrary code during unpickling. If an attacker can provide maliciously crafted pickled data, it may lead to code execution on the system.

Let’s explore a simple example to illustrate the security risk.

Let’s assume that some untrusted agent pickles an instance of the following class and then we are passed with the object and we are trying to deserialize it.

import os, pickle

class MaliciousCode:
    def __reduce__(self):
        # This method gets called during deserialization by pickle
        return (os.system, ("echo Malicious code executed!",))

# Serialize the object
serialized_data = pickle.dumps(MaliciousCode())

# Deserialize the object (malicious code gets executed)
deserialized_object = pickle.loads(serialized_data)

What do we expect to happen here:

Class instance is dumped as pickled object as usual.
During deserialization (unpickling with pickle.loads) the __reduce__ method of the pickled instance will run. This method returns a callable, in this case the system method of module os and arguments for that callable, in this case a simple echo command.

Let’s see what prints out:

➜  example python3.11 pickle_demo.py
Malicious code executed!

Well, the unpickling was successful, and during the process it accessed our system through the os module of Python and printed out in our console with a normal echo command.

What if the command wasn’t that innocent? Imagine that this command could be anything really that could be executed in a system. So essentially, with a few (even a simple) commands it could destroy our system for good..

To mitigate the security risks associated with unpickling, consider the following best practices:

Avoid Unpickling Untrusted Data: Do not unpickle data from untrusted or unauthenticated sources.
Use Safe Alternatives: If you need to exchange data between systems and security is a concern, consider using safer alternatives like JSON or XML for serialization.
Use Restricted Environments: If unpickling is necessary, consider doing it in a restricted environment (such as a sandbox) with limited privileges.
Implement Whitelists: If you control the pickling and unpickling process, implement whitelists to only allow certain classes or objects to be unpickled.

By following these best practices, you can reduce the risk of code execution through maliciously crafted pickled data.

Conclusion

In conclusion, delving into the world of Python’s pickle module unveils a powerful tool for object serialization. Its versatility streamlines data storage and exchange, enhancing the efficiency of Python applications. While pickle offers remarkable benefits, it's essential to be mindful of potential security risks, especially when handling untrusted data. By incorporating best practices like avoiding unpickling from untrusted sources and embracing alternative serialization formats when needed, developers can harness the full potential of pickle while maintaining a robust and secure coding environment. As we navigate the intricacies of pickling, let's leverage its strengths to build resilient and efficient Python applications.

That’s all folks!

I hope you find that one useful. If so, please leave your clap and/or your feedback in the comments sections. Feel free to follow for more content like that.

Python Pickle : From Whys and Hows to Dos and Don’ts

All about Python pickle module in one place

Table of contents

Introduction

Pickling in Python

What is it?

Why do we need it?

How does it work?

Types supported in pickling

Benefits

Drawbacks

Usage of pickle module

Custom pickling

Approach 1: __setstate__ and __getstate__

Approach 2: __reduce__ and __reduce_ex__

Which approach to use?

Pickled extras

Python 2 — Python 3 pickle protocols compatibility

Malware execution through pickle loaded data

Conclusion

Approach 1: setstate and getstate

Approach 2: reduce and __reduce_ex__