Bencoding (encoding of a .torrent file)

The Bencode format is an interesting design. It is byte based, which makes it safe from big-endian and little-endian translations. Somewhere when reading about how torrents worked, I got looking at their file format. As far as I know, the Bencode format isn’t used on anything but torrent files. The format is pretty simple, with only 4 different data structures: Byte String, Integer, List, and Dictionary.

Bencode Basics

Byte String

This is formatted as [integer length]:[byte string]. Simply read the length number of bytes following the colon.

Integer

The integer type is structured as i[contents]e. Contents can be positive or negative integer, without leading zeros.

Valid examples: i-1234e i1234e

List

The list type (or array if you like) is formatted as l[contents]e. Contents will be any valid Bencode types just appended together.

Dictionary

The dictionary type (also know as a hash or an associative array) is formatted as d[contents]e. Contents includes [key][value] pairs until termination. The key value must always be a byte string, but the value may be any Bencode type (even another dictionary).

The Code

It is possible to store complex structures with these simple types. I enjoyed learning the format by creating a Python class for encoding and decoding data in Bencode format. The source code for this class is available on my github.

I have been using Python for quite a while as a scripting language, but only recently started trying to create proper classes and develop more advanced OO designed applications with Python. I’m sure this isn’t purely Pythonic and would welcome any criticism and ways to improve. I wanted to call out a few things in the code that were good learning experiences for me.

Encoding Methods

Functions are objects in Python and can be stored and referenced just like data. In the internal encoding function _local_encode, I’m storing the various data type encode methods in a dictonary with data type string as key. This allows me to get data type string with type(data).__name__. If the key exists in the dictonary, I envoke that method to process the data.

def _local_encode(self, data):
    """ Returns the proper Bencoded string for data given """
    encoders = {'dict': self._encode_dictionary,
                'list': self._encode_list,
                'tuple': self._encode_list,
                'str': self._encode_string,
                'int': self._encode_integer,
                'float': self._encode_float}
    data_type = type(data).__name__
    if data_type in encoders.keys():
        return encoders[data_type](data)
    else:
        raise Exception('Encoder is not defined for data type: %s' % data_type)

String encoding is very simple, just joining length of the string with the data.

@staticmethod
def _encode_string(data):
    """ Converts string to Bencoded Byte String """
    return '%d:%s' % (len(data), data)

Encoding integer is just wrapping ‘i’ and ‘e’ around the string conversion of the integer. Since bencode does not encode floats, those are converted to ints.

@staticmethod
def _encode_integer(data):
    """ Converts Integer to Bencoded Integer string """
    return 'i%de' % data

def _encode_float(self, data):
    """ Converts Float to Bencoded Integer string """
    return self._encode_integer(int(data))

List encoding performs recursive calls into _local_encode as it loops through the list, just appending items. I optimised with lists instead of string concatenation. Originally, I built this with the ben variable as a string and += calls, which is terrible for memory performance.

def _encode_list(self, data):
    """ Converts List or Tuple to Bencoded List string """
    ben = ['l']
    for item in data:
        ben.append(self._local_encode(item))
    ben.append('e')
    return ''.join(ben)

Dictionary works the same way as list, just appending the generated values from _local_encode with the addition of a string in front. Again, this was changed from string concatenation to list append.

def _encode_dictionary(self, data):
    """ Converts Dictionary to Bencoded Dictionary String """
    ben = ['d']
    for key in sorted(data.keys()):
        ben.append(self._encode_string(key))
        ben.append(self._local_encode(data[key]))
    ben.append('e')
    return''.join(ben)

This came out simpler and cleaner than I expected for encoding. But it is a really simple format, so it shouldn’t be too complex.

Decoding Methods

For my external decode method, I’m using a class level pointer for the bencoded string and walking through it. I setup those two variables and call into _local_decode.

def decode(self, ben_string):
    """ Returns the data structure defined by the Bencoded string """
    try:
        self._ben_string = ben_string
        self._pointer = 0
        return self._local_decode()
    except IndexError as e:
        print("Index Error at Pointer = %d of %d\n%s" % (self._pointer, len(self._ben_string), str(e)))

While I thought I might use the dictonary mapped methods again, I couldn’t find a clean way to handle the 11 possible cases for integer. So, I decided to just just go old school flow control. This is one situation that I would have used a switch statement with an else case if it was available in Python. If you have a more Pythonic way of doing this, please ley me know!

def _local_decode(self):
    """ Detects and returns the proper Bencoded data type starting at current Pointer position """
    cur_char = self._ben_string[self._pointer]
    if cur_char == 'i':
        return self._parse_integer()
    elif cur_char == 'l':
        return self._parse_list()
    elif cur_char == 'd':
        return self._parse_dictionary()
    elif cur_char in ('-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'):
        return self._parse_byte_string()
    else:
        raise Exception('Invalid character detected at position %d : %s' % (self._pointer, str(cur_char)))

Before we get into Integer and String decoding, we have a helper function. Both the Integer and the String will have a certain number of characters to parse into an integer. The only difference is the stop character. For the String, this is the length of the string with ‘:’ end character. For the Integer, this is the actual data value with ‘e’ end character. You will see this method called from both decode methods. Note that the int() parsing handles the leading ‘-‘ fine for negative integers.

def _get_number(self, end_char):
    """
    Returns number in string up to termination character given.
    """

    start_pointer = self._pointer
    while self._ben_string[self._pointer] != end_char:
        self._pointer += 1
    num = int(self._ben_string[start_pointer:self._pointer])  # Parse int
    self._pointer += 1  # Consume the end_char
    return num

The only job of the integer decode method is to consume the leading character that doesn’t exist in the string integer. Then all the work is turned over to the _get_number method above up to the ‘e’ character.

def _parse_integer(self):
    """
    Parses a Bencoded Integer type of format i<number>e
    """
    self._pointer += 1  # Consume the 'i' start character
    return self._get_number('e')

The String decoding gets the length integer this the helper function up to the ‘:’ character. Then it gets the string for return, and increments the pointer past the string.

def _parse_byte_string(self):
    """
    Parses a Bencoded Byte String type of format <len>:<byte string>
    """

    str_len = self._get_number(':')
    byte_string = self._ben_string[self._pointer:self._pointer + str_len]
    self._pointer += str_len  # Consume the byte string
    return byte_string

The list starts with ‘l’ and ends with ‘e’. So we move the pointer to consume the ‘l’, then we loop until the current pointer character is an ‘e’. Each time through the while, the class level pointer might increment quite a bit, as data items are consumed. However, we are not at the end of the list until we see a letter ‘e’.

For each list item, much like the recursive encoding, we call _local_decode to return us whatever the data is.

Since it is the responsibility of all the decoding methods to consume their bytes, we increment the pointer to move past the list termination ‘e’ at the end.

def _parse_list(self):
    """
    Parses a Bencoded List type of format l<contents>e

    All members of the list as simply appended together in Bencode format to make up <contents>
    """
    self._pointer += 1  # Consume the 'l' start character
    val = []

    while self._ben_string[self._pointer] != 'e':
        val.append(self._local_decode())

    self._pointer += 1 # Consume the 'e' termination character
    return val

This is very similar to list processing, with the addition of the key. We know that the key is required to be a byte string, so we call directly to that method.

def _parse_dictionary(self):
    """    Parses a Bencoded Dictionary type of format d<pairs>e

    All pairs consist of ByteString key followed by any valid data type
    """
    self._pointer += 1  # Consume the 'd' start character
    be_dict = {}
    while self._ben_string[self._pointer] != 'e':
        key = self._parse_byte_string()   # Skipping local decode, because key must be a byte string
        val = self._local_decode()
        be_dict[key] = val
    self._pointer += 1  # Consume the 'e' termination character
    return be_dict

If you have a cleaner implementation you would like to share, please contact me. I found the few iterations on this to be enjoyable and a good learning experience.

The code is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License


Comments