Inside Python's itertools.tee()

Itertools is a great library, but some methods definitely receive more attention than others – for instance, I’d wager that chain is a lot more well-known than tee. Python’s documentation describes tee as follows:

Return n independent iterators from a single iterable.

It also adds this caveat:

Once tee() has made a split, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed.

However, the documentation doesn’t really explain why you might want to use tee, instead of copying the iterable n times. The short answer is that tee() uses some heuristics and strategies to make the generation of these n iterators more memory efficient. This article will peek inside Python’s internals and explore these strategies in more detail.

Iterables, Iterators, and the Iterator Protocol

To understand the basis for the optimizations, it is essential to know how Python’s iterator protocol works. (More experienced Pythonistas might want to skim this section – it’s here for completeness.)

Iterables are Python objects that define an __iter__() method, which, when called, returns an iterator. Iterators define the __iter__() method as well, but implement it by simply returning the iterator itself. Iterators also implement a next() method, which returns the next element in a sequence. To signal the end of a sequence, next() raises a StopIteration exception. Any object that implements both __iter__ and next in the aforementioned manner is said to implement the iterator protocol.

To actually use these iterators and iterables, we turn to Python’s classic for item in sequence loop. Here, sequence can be either an iterator or an iterable – it doesn’t matter, because the loop will begin by implicitly invoking seq’s __iter__ method, which will return an iterator. Now, on each round through the loop, the Python interpreter will invoke the iterator’s next method and assign it to item, and will continue to do so until next raises StopIteration.

Perhaps you are wondering why Python makes the distinction between iterables and iterators, and why iterators don’t have prev or rewind methods. These two questions are in fact related. It can be expensive or complicated to implement these other methods – for instance, reading sequentially forwards from a file corresponds simply to fread() in C, but rewinding to the start is a more expensive fseek call, and there is no efficient way to read sequentially in reverse.

More importantly, prev and rewind are not essential for iteration – they can be built upon the primitive next method. For instance, if we really wish to access the earlier contents of the file in Python, we could buffer the earlier results ourselves. However, buffering all the earlier values is generally inefficient and wasteful: many programs have no need to access earlier values of an iterator, and those that do tend to limit themselves to one or two elements before the current one. Rewinding an iterable is a more common use case, but the same thing can be achieved by obtaining an entirely new iterator that points to the start of the sequence. To do that, we need a factory that produces iterators – which is exactly what an iterable is.

In sum, Python’s iterator protocol is simple to implement, and it works well for the common use case of for .. in loops.

Sometimes Buffering is Useful

The starkness of Python’s iterator design means that more complicated use cases will need to build their own abstractions on top of it. Fortunately, its simplicity also means that it is easy to build upon.

tee() is one such abstraction, built to efficiently create n independent iterators. Consider file I/O again: we might not want to create n file iterations by calling fopen() over and over, especially if n is large. Nor would we want to copy iterators that do heavy computation each time we called next(). In both cases buffering is a better solution, and this is what tee does, collating in one central list the values returned by the original iterator. Its return value consists of n tee iterator objects,1 each of which stores a single integer index that indicates the next element it should return from the list. The list itself is populated lazily – whenever any of the n iterators asks for a value at an index that is not yet in the buffer, tee() calls next on the underlying iterator, then caches and returns the new value.

With this implementation detail in mind, the reasoning behind the documentation’s caveat becomes clearer – using the iterator outside of tee() would cause some (or all) values to be lost from the cache.

Sometimes Buffering is Not Useful

Buffering is not always the most efficient way to create n independent iterators. For example, next could be a computationally cheap function, and it would be more efficient for us to just copy it n times. We can indicate this fact by defining a __copy__ method on our iterator.2 If tee() detects the presence of __copy__, it will copy the iterator instead of doing buffering. Moreover, it will only do this copying n - 1 times – the last iterator will simply be the original one that was passed in! Since the original iterator should never be used after being passed to tee(), this makes perfect sense; the end result will appear the same to the library consumer.

In fact, tee objects themselves implement __copy__! Since any tee object is backed by a central buffer, the most efficient way to duplicate one is to have the copy point to the same buffer. So not only does tee duplicate any iterator efficiently, it also ensures that future duplications are efficient.

Notably, as a consequence of this optimization, passing in a non-tee object that implements __copy__ will return a tuple of copies of that object, not tee objects. In practice, the type of the object should not matter as long as one sticks to using it solely as an iterator.

Efficient Buffering: The gritty details

If the total number of items generated by the iterator is very large, storing all of the generated values might take up a lot of memory. If we can keep track of all the indices of the tee objects generated by a tee() function call, we can delete old values once all of the indices have passed it. (Since iterators can only progress forwards, we know that these values will never be used again.) With CPython’s reference counting, this is actually quite simple: the buffer can be implemented as a singly linked list, with each tee object holding a pointer to the next element that it will return. As a natural consequence, once the last tee iterator is done with a buffer element, the element’s reference count goes to zero and it gets garbage collected.

Tee linked list buffer with 3 tee iterators using it. The grey cell has no pointers to it, so the reference counting mechanism will free it up.

However, linked lists perform worse than arrays in terms of memory fragmentation. Moreover, they incur lots of overhead in memory allocations and deallocations. As a compromise, CPython uses linked arrays – a linked list with arrays as individual elements. Moreover, these arrays are sized such that the entire array element is 64 bits in size, fitting nicely into cache lines.

A Python Implementation

Python’s documentation gives pure-Python implementation of tee(), but it simplifies things and leaves out the buffering and copy semantics. I’ve written up a slightly more involved one that reflects the optimizations described above. Unlike the version in the docs, this implementation passes the standard library’s test suite. It still makes some concessions to simplicity: for instance, buffering uses a single list, instead of the linked arrays described above.

    class TeeIterator(object):

        def __new__(cls, tee_data):
            if isinstance(tee_data, TeeData):
                self = super(TeeIterator, cls).__new__(cls)
                self.tee_data = tee_data
            else:
                self = TeeIterator.from_iterable(tee_data)
            self.index = 0
            return self

        def __copy__(self):
            return TeeIterator(self.tee_data)

        def __iter__(self):
            return self

        def next(self):
            rv = self.tee_data[self.index]
            self.index += 1
            return rv

        @classmethod
        def from_iterable(cls, iterable):
            if isinstance(iterable, cls):
                return iterable.__copy__()
            tee_data = TeeData(iter(iterable))
            return TeeIterator(tee_data)

    class TeeData(object):

        def __init__(self, iterator):
            self.iterator = iterator
            self.buffer = []

        def __getitem__(self, index):
            if index == len(self.buffer):
                self.buffer.append(next(self.iterator))
            return self.buffer[index]

    def tee(iterable, n=2):

        if n < 0:
            raise ValueError("n must be >= 0")
        elif n == 0:
            return ()

        if hasattr(iterable, '__copy__'):
            copyable = iterable
        else:
            copyable = TeeIterator.from_iterable(iterable)

        result = [ copyable ]

        for i in xrange(1, n):
            result.append(copyable.__copy__())

        return tuple(result)

  1. Yes, the returned objects are called tee as well, but they are completely distinct from the itertools.tee() function.

  2. copy.copy() looks for this method too.

Related Posts

Jez Ng 28 June 2012
blog comments powered by Disqus