When working with data inputs in Python — processing text, doing statistical analysis — we are working with strings.
When reading files from disc into Python we decode binary data into strings and when saving text to disc we encode stings to binary.
str.encode() method is encoding strings to binary format before saving them to disc:
b at the start of the above expression specifies that this is a byte string.
Byte string is simply a sequence of 0-255 integers:
[99, 97, 102, 195, 169]
Note the difference in length of the string and byte representation due to the fact that
é occupies two bytes in byte representation:
Default encoding for Python when saving files is UTF-8, unless another encoding is specified explicitely:
'café'.encode('cp1252') # ascii characters in binary are displayed 'as-is'
Default encoding when reading files is picked from OS settings and is usually
UTF-8 bor nix systems and
cp1252 for Windows systems.
If the file was created on Linux, but opened on Windows with default
cp1252 (actually left unspecified on Windows machine), we can get unexpected results:
open('cafe.txt', 'w', encoding='utf_8').write('café') # written on Linux open('cafe.txt', 'r', encoding='cp1252').read() # read on Windows (with no encoding actually specified)
This small example provides motivation for getting encoding right when reading binary files.
Short answer: guess it right!
In Linux, and Python, there is a utility that provides an educated guess on what an encoding could be through heuristics of binary content of a file to be opened:
cafe.txt: utf-8 (confidence: 0.51)
Other possible names of the utility are: