Customizing pickles

With most common Python objects, pickling just works. Basic primitives such as integers, floats, and strings can be pickled, as can any container objects, such as lists or dictionaries, provided the contents of those containers are also picklable. Further, and importantly, any object can be pickled, so long as all of its attributes are also picklable.

So, what makes an attribute unpicklable? Usually, it has something to do with time-sensitive attributes that it would not make sense to load in the future. For example, if we have an open network socket, open file, running thread, or database connection stored as an attribute on an object, it would not make sense to pickle these objects; a lot of operating system state would simply be gone when we attempted to reload them later. We can't just pretend a thread or socket connection exists and make it appear! No, we need to somehow customize how such transient data is stored and restored.

Here's a class that loads the contents of a web page every hour to ensure that they stay up to date. It uses the threading.Timer class to schedule the next update:

from threading import Timer 
import datetime 
from urllib.request import urlopen 
 
class UpdatedURL: 
    def __init__(self, url): 
        self.url = url 
        self.contents = '' 
        self.last_updated = None 
        self.update() 
 
    def update(self): 
        self.contents = urlopen(self.url).read() 
        self.last_updated = datetime.datetime.now() 
        self.schedule() 
 
    def schedule(self): 
        self.timer = Timer(3600, self.update) 
        self.timer.setDaemon(True) 
        self.timer.start()

url, contents, and last_updated are all pickleable, but if we try to pickle an instance of this class, things go a little nutty on the self.timer instance:

>>> u = UpdatedURL("http://dusty.phillips.codes")
^[[Apickle.dumps(u)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't pickle _thread.lock objects

That's not a very useful error, but it looks like we're trying to pickle something we shouldn't be. That would be the Timer instance; we're storing a reference to self.timer in the schedule method, and that attribute cannot be serialized.

When pickle tries to serialize an object, it simply tries to store the object's __dict__ attribute; __dict__ is a dictionary mapping all the attribute names on the object to their values. Luckily, before checking __dict__, pickle checks to see whether a __getstate__ method exists. If it does, it will store the return value of that method instead of the __dict__.

Let's add a __getstate__ method to our UpdatedURL class that simply returns a copy of the __dict__ without a timer:

    def __getstate__(self): 
        new_state = self.__dict__.copy() 
        if 'timer' in new_state: 
            del new_state['timer'] 
        return new_state

If we pickle the object now, it will no longer fail. And we can even successfully restore that object using loads. However, the restored object doesn't have a timer attribute, so it will not be refreshing the content like it is designed to do. We need to somehow create a new timer (to replace the missing one) when the object is unpickled.

As we might expect, there is a complementary __setstate__ method that can be implemented to customize unpickling. This method accepts a single argument, which is the object returned by __getstate__. If we implement both methods, __getstate__ is not required to return a dictionary, since __setstate__ will know what to do with whatever object __getstate__ chooses to return. In our case, we simply want to restore the __dict__, and then create a new timer:

 def __setstate__(self, data): self.__dict__ = data self.schedule()

The pickle module is very flexible and provides other tools to further customize the pickling process if you need them. However, these are beyond the scope of this book. The tools we've covered are sufficient for many basic pickling tasks. Objects to be pickled are normally relatively simple data objects; we likely would not pickle an entire running program or complicated design pattern, for example.