Python: Strings

April 15, 2017
4 min. read

This post is part of the Python Tips series.

Python 3 Strings

The handling of strings was one of the major breaking change between Python 2 and 3. Python 2 was easy for English speaking folks that could live in 8-bit ASCII land. It was easy to use a string as a sequence of bytes when doing binary transfer. However, turning 8-bit ASCII into Unicode was a little daunting.

With Python 3, all strings are Unicode. To get a bytes object as a string would be in Python 2, you prefix a string with a b as in b'bytes object with \xf4 <- high 8-bit character'.

If you haven’t seen a hexadecimal byte representation before, such as \xf4, let me cover that quickly. One byte is 256 value, from 0 to 255. This corresponds to hexadecimal 0x00 to 0xff. The representation in a bytes string is \x00 to \xff. Python will print out bytes in this hex format when the characters are unprintable. You can also enter a bytes object the hard way. Defining b'\x62\x79\x74\x65\x73' is the same thing as b'bytes'.

When Unicode strings are transmitted or stored in files as binary, we need to send them as bytes. So we need to encode strings to bytes and decode bytes to strings. This is done with '.encode()' method on strings and '.decode()' on bytes. Both take encoding type as arguments. You will get errors if invalid values exist for either of these methods.

Normalize bytes into unicode strings with decoding as soon as possible after receiving. Do not mix encodings. Things get ugly. UTF-8 will handle most things. Try to stay there.

Methods

Some useful methods on string:

s.upper() and s.lower() - returns uppercase and lowercase version of string

s.strip() - remove whitespace from beginning or end

s.isalpha() s.isdigit() - returns boolean if all characters are alpha or digits

s.startswith(‘begin string’) s.endswith(‘end string’) - returns boolean if starts or ends string

s.replace(‘find’, ‘replace’) - replaces all occurrences of ‘find’ with ‘replace’

s.split(‘delimiter’) - splits string into list of strings based on provided string

s.join(list, “) - joins list of strings into one string with second string in between each. More commonly used as “.join(list)

Concatenating Strings

Strings in Python are immutable. This doesn’t mean that you can change the value of a string variable. When you do, Python creates a new string and then points the variable to the new string. This costs processing time and memory until the garbage collector cleans it up.

Strings can be joined with the + operator. The problem is that this creates and throws away strings often. If you are building up a string in many parts, there is a more efficient and pythonic method. Use a list. This keeps all string objects, until you create a large one and throw them away.

How do we know this is better? Python has a timeit module that allows you to run a method or code and test execution time. I tested naive appending with +, list build then join, and using a file-like object to hold data. Below is my test code and results:

import io
import timeit

number_max = 100000

# build a list of strings that are unique

to_append = list([str(i) for i in range(number_max)])


def naive_method():
    """ Test using + appending """
    build_str = ''
    for val in to_append:
        build_str += val
    return build_str


def list_join_method():
    """ Test using list build and join """
    build_list = []
    for val in to_append:
        build_list.append(val)
    return ''.join(build_list)


def pseudo_file_method():
    """ Test with pseudo file """
    with io.StringIO() as fake_file:
        for val in range(number_max):
            fake_file.write('val')
        return fake_file.getvalue()


# timeit will run each function this number of times

run_count = 200

print('naive_method: {}'.format(timeit.timeit(naive_method, number=run_count)))
# naive_method: 3.3670036327086508


print('list_join_method: {}'.format(timeit.timeit(list_join_method, number=run_count)))
# list_join_method: 2.0951417760739637


print('pseudo_file_method: {}'.format(timeit.timeit(pseudo_file_method, number=run_count)))
# pseudo_file_method: 2.514028840406624

List join method is 2.1 seconds, pseudo file is 2.5 seconds, and naive ‘+’ method is 3.4 seconds.

If in doubt about which method of doing things is best, it is good to quickly test. However, performance isn’t always the most important, if it hampers readability and understanding of code. Otherwise, you would just write in a language where performance is first. Python is just as much about programmer performance as code performance.

The language has been developed so that typical pythonic methods are the fastest or close enough to not matter. Just like PEP-8, writing pythonic code is good practice as it creates a common grammar that allows programmers to more easily parse each other’s code.


Part 2 of 9 in the Python Tips series.

Python: Starting Tips | Python: Functions and Mutable Defaults

comments powered by Disqus