Difference between bytes and strings

When working with data inputs in Python — processing text, doing statistical analysis — we are working with strings.

In [7]:

When reading files from disc into Python we decode binary data into strings and when saving text to disc we encode stings to binary.

str.encode() method is encoding strings to binary format before saving them to disc:

In [8]:

The small b at the start of the above expression specifies that this is a byte string.

In [12]:

Byte string is simply a sequence of 0-255 integers:

In [14]:
[99, 97, 102, 195, 169]

Note the difference in length of the string and byte representation due to the fact that é occupies two bytes in byte representation:

In [16]:
len('café'), len('café'.encode('utf8'))
(4, 5)

Default encoding for Python when saving files is UTF-8, unless another encoding is specified explicitely:

In [11]:
'café'.encode('cp1252')  # ascii characters in binary are displayed 'as-is'

Default encoding when reading files is picked from OS settings and is usually UTF-8 bor nix systems and cp1252 for Windows systems.

If the file was created on Linux, but opened on Windows with default cp1252 (actually left unspecified on Windows machine), we can get unexpected results:

In [5]:
open('cafe.txt', 'w', encoding='utf_8').write('café') # written on Linux

open('cafe.txt', 'r', encoding='cp1252').read()       # read on Windows (with no encoding actually specified)

This small example provides motivation for getting encoding right when reading binary files.

How to get encoding right?

Short answer: guess it right!

In Linux, and Python, there is a utility that provides an educated guess on what an encoding could be through heuristics of binary content of a file to be opened:

In [6]:
!chardet cafe.txt
cafe.txt: utf-8 (confidence: 0.51)

Other possible names of the utility are: chardet, chardet3, chardetect, chardetect3

Write a comment:


Your email address will not be published.

© 2014 In R we trust.
Follow us: