Python: Collections Library

July 10, 2017
8 min. read

This post is part of the Python Tips series.

One of the best ways to become efficient with Python is to really learn the standard libraries. However, do this with understanding that some are included for base functionality or compatibility, but have been superseded by better packages in PyPi. Requests over urllib2 is a great example of this. By being outside of Python, these packages can iterate faster and not be dependent on Python releases. Collections is not one that needed replacement.

The collections library provides data structures that are more specialized than the base list, dict, tuple, and set types. I’ll go through all the data types on collections and show why they are useful. I am not going to have a set of normal examples, because the Python documentation for collections is really good. I would suggest reading through those as you work through some examples and try things out on your own.

OrderedDict

OrderedDict is exactly what it sounds like, a dictionary that is ordered. A normal dictionary does not promise to maintain insertion order. Some recent optimizations to the dict that actually do maintain order. However, this is an implementation detail in CPython and not a language detail in Python. When you need to keep the order, use an OrderedDict.

I have a hardware system with 144 outputs of 2 types for 288 total bits of data. These are shifted out serially to the hardware boards, so I want a structure that maintains this order. However, I want to toggle the bits on and off easily. So my keys are a namedtuple with the number and type.

It would be possible to do this with the dictionary holding an index to the list and store values in a list. This is much simpler. I change values via dictionary interface and read data via the in-order iterator.

Counter

A Counter is a specialized dictionary. It holds keys and a count of how many times you have given the key. It was designed to handle positive integers and represent a running count of some type. However, nothing has been done to eliminate the use of negative values.

most_common() requires that the values be orderable, but with no other restrictions. This will return a sorted list (based on count) of key, value tuples. Providing a value to most_common(n) will return the top n values. By calling with no values and using slicing, it is possible to get least common pairs.

There are a few ways to initialize a Counter:

  • Empty: Counter() for an empty counter, ready to begin accumulating data.
  • Iterable: Counter('abcdcefgseda') or Counter([12, 32, 9, 7, 14, 19, 3]). This will accumulate the counts of the iterable as the start condition of counter.
  • dict or mapping: Counter({'blue': 4, 'yellow': 3}) used to load data previously counts in mapping form.
  • keyword arguments: Counter(blue=4, yellow=3) similar to mapping, but using the **kwargs to generate the mapping.

I have found the math operations useful at times, add + and subtract - will work with all counts for same key. Intersection & and Union | can be thought of as minimum and maximum for each count value.

elements() is also interesting, as it will recreate the sequence by duplicating the keys the number of times to give count.

namedtuple

Returning more than one value from a method is a common use of tuples. The problem with using tuples as a data structure is there is no documentation of the format. You can create an object for this, but that is more overhead (even using __slots__) than needed, if there is no functionality in the object.

The namedtuple is a an elegant way to solve this. It allows indexing by number and named field. It also makes a nice representation with fields, when printing out the object. It is common to see a recordset from a DB as a list of namedtuples, giving you the field name based access.

If you have tons of namedtuples around, the space for each is exactly the same as tuples (as verified with sys.getsizeof() below). However, there is an overhead for the structure and fields of the namedtuple definition (448 bytes below).

In other words, the only disadvantage is a single fixed memory cost, not related to number of instances. Well worth it in many instances for self documenting, and light-weight, data structures.

>>> from collections import namedtuple
>>> import sys
>>> Person = namedtuple('Person', 'name age weight')
>>> person1 = Person('John', 25, 220)
>>> person2 = ('John', 25, 220)
>>> sys.getsizeof(person1)
40
>>> sys.getsizeof(person2)
40
>>> sys.getsizeof(Person)
448
>>>

defaultdict

The defaultdict is a subclass of dict, which allows a factory method to be used when a key is not in the dict. The first argument is the factory method and all other arguments are what is valid for dict definition.

I have seen programmers use defaultdict in places where it was unneeded. The most common is when they just want a default value and not a default object. The easiest solution for this is using the get method of dictionary which allows the following: dict.get(key, default_value). This does not help when you want the default value to be a list and immediately append to it or just need the value to exist in a dict after your read.

In addition to the typical list, dict, set commonly used for the factory method, you can define your own. Also, simpler data types like int or bool work. I was curious what the default state was for the bool factory, so I tried it:

>>> from collections import defaultdict
>>> b = defaultdict(bool)
>>> b[0]
False
>>> b[3]
False
>>> b['joe']
False
>>> b
defaultdict(<class 'bool'>, {0: False, 'joe': False, 3: False})

>>>

So bool initializes as False. Notice how just reading values will create the keys. If we use the default value of get, the 3 reads are the same, but the resulting dict isn’t populated:

>>> c = {}
>>> c.get(0, False)
False
>>> c.get(3, False)
False
>>> c.get('joe', False)
False
>>> c
{}
>>>

Look at the Python docs for typical examples with lists, the normal use case for examples on defaultdict.

deque

The deque is short for double-ended queue and pronounced ‘deck’. This flexible data structure allows insertions and removals from either end. Initialization allows passing an iterable of some type and a max length. If a length is specified, the deque will become lossy once full.

In addition to the append you are familiar with in lists, we have appendleft. And extend also gets an extendleft pair for appending values from iterables. The pop method also has a popleft pair.

A FIFO (first in, first out) queue could be implemented by appending or extending values and poplefting them. A LIFO (last in, first out) stack would would be the same as using a list, as nothing is needed from the left.

When working with a limited length deque, the dropped values are not based on age but on which side you pushed the new values onto. In the session below, I define a maxlen of 3 and fill it up. Then push old values off each side. Notice we drop 4, even though it is the newest previous value.

>>> from collections import deque
>>> short = deque(maxlen=3)
>>> short
deque([], maxlen=3)
>>> short.append(1)
>>> short.append(2)
>>> short.append(3)
>>> short
deque([1, 2, 3], maxlen=3)
>>> short.append(4)
>>> short
deque([2, 3, 4], maxlen=3)
>>> short.appendleft(5)
>>> short
deque([5, 2, 3], maxlen=3)
>>>

ChainMap

ChainMap is useful when you need to combine multiple dictionaries and maps into one updateable view. If you don’t specify any, it will create a single empty dictionary. This is much faster than creating a new dictionary and pushing all of them into it.

When we mention that it is updateable, you have to understand that writes and deletions to the first mapping in the chain. This isn’t shown really well in documentation, so lets do a quick example.

Here we import ChainMap, and define two dicts with different and same keys. Notice how the ChainMap.__repr__ looks like a list of dicts. That is how they are stored inside.

>>> from collections import ChainMap
>>> a = {'name': 'John', 'age': 25}
>>> b = {'spouse': 'Jill', 'age': 23}
>>> c = ChainMap(a, b)
>>> c
ChainMap({'age': 25, 'name': 'John'}, {'age': 23, 'spouse': 'Jill'})

We access ‘age’ and get 25, showing that mapping will return the first found in order. But if ‘spouse’ isn’t in the first one, we get the second mapping value. We know the couple should be Jack and Jill, so we update ‘name’ to ‘John’. We see it in the __repr__.

>>> c['age']
25
>>> c['spouse']
'Jill'
>>> c['name'] = 'Jack'
>>> c
ChainMap({'age': 25, 'name': 'Jack'}, {'age': 23, 'spouse': 'Jill'})

What happens when we set values for keys that are in more than one or keys that are not in the first mapping? We change ‘age’ to 30 and see that the age is changed in the first mapping. If we look at the original a dict object, it is changed there as well.

>>> c['age'] = 30
>>> c
ChainMap({'age': 30, 'name': 'Jack'}, {'age': 23, 'spouse': 'Jill'})
>>> a
{'age': 30, 'name': 'Jack'}

We might expect changing ‘spouse’ to alter the original value in b. However, when we print out b, it is unchanged. We see in the ChainMap that we added a new ‘spouse’ key to the a dict.

>>> c['spouse'] = 'Jilly'
>>> b
{'age': 23, 'spouse': 'Jill'}
>>> c
ChainMap({'age': 30, 'spouse': 'Jilly', 'name': 'Jack'}, {'age': 23, 'spouse': 'Jill'})
>>>

In operation, think of a ChainMap as a single mutable dict reference and multiple fall back values. This is useful for settings with a local, user, system, default chain. And deviations from the standards are saved in local.

UserList, UserDict, UserString

These objects were created to allow custom dict, list, or string like objects. It used to not be possible to subclass these types directly. However, things can be simpler if you do not do this. Internal list or dict is stored in the data attribute.


Part 9 of 9 in the Python Tips series.

Series Start | Python: Dunder name

comments powered by Disqus