Unicode in Python
Before the Unicode
- Memory consists of bytes
- A string is a chain of bytes
- One byte can have up to 256 values
- One byte can mark one of 256 symbols
- To print a byte, you should find a symbol that it marks...
- ... with a special table, named
Code page
ASCII Code page
- a special table that compares bytes with symbols
- ASCII table as a standard
- 0..127 for latin and system symbols
- 128..255 for extended symbols
How about foreign languages?
- 0..127 for latin and system symbols anyway
- 128..255 for realm-specific symbols
- One realm — one code page
cp866
DOS-Cyrilliccp1251
Windows-Cyrilliccp1253
Windows-Creek- etc
Encoding hell
- Lots of code pages
- Every message should declare the code page
- Encode/decode errors, strange symbols instead of text
- Remember your inbox 10 years ago
How does it fail
- User A types
"Привет"
on Linux (utf-8) - The bytes are
\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82
- The program should send a header with encoding, but it doesn't
- User B receives the message without header on Windows (cp1251)
- So default code page becomes cp1251
- Decoded message is
Привет
- :--(
The solution is Unicode
- It's a common unlimited alphabet
- Every symbol has its own number
- Unicode is growing
U+XXXX
to refer a specific code
U+0031 1
U+00A9 ©
U+20AC €
U+266B ♫
U+4E64 乤
U+45B4 䖴
What is UTF?
- An algorithm how turn Unicode into bytes and read it back
- There are
UTF-8
,-16
and-32
encodings UTF-8
uses 1 byte for latin and 2 bytes for non-latinUTF-8
is compatible with english text- Can encode large subset of Unicode
Still have problems with Unicode?
- Try to use
UTF-8
encoding everywhere - Read Joel Spolsky "About Unicode and Character Sets"
Unicode in Python 2.x
- Unicode strings are declared with
u'...'
prefix - The should be coding directive in your file
- Use
\uXXXX
to refer unicode symbols0..FFFF
\UXXXXXXXX
to refer0..FFFFFFFF
symbols (rare)
# coding=utf-8 # -*- coding: utf-8 -*-
name = u'Иван'
name = u'Ιαννης'
name = u'João'
greet = u'\u041f\u0440\u0438\u0432\u0435\u0442'
print greet
>>> Привет
print u'\U0000041f\U00000440\U00000438\U00000432\U00000435\U00000442'
>>> Привет
data = u"""
Any string in english
Любая строка на русском
任何字符串在中國"""
print data
Any string in english
Любая строка на русском
任何字符串在中國
repr(data)
u'\nAny string in english\n\u041b\u044e\u0431\u0430\u044f \u0441\u0442\u0440\u043e\u043a\u0430 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c\n\u4efb\u4f55\u5b57\u7b26\u4e32\u5728\u4e2d\u570b\n'
How to turn an object into Unicode
unicode([1, 2, True, {"test": 42}])
>>> u"[1, 2, True, {'test': 42}]"
unicode(obj) == obj.__unicode__()
class User(object):
def __unicode__(self):
return u'%s %s' % (self.name, self.surname)
user = User(...)
unicode(user) # u'Михаил Паниковский'
How to turn a 8-bit string into Unicode
message = '\xcf\xf0\xe8\xe2\xe5\xf2'
decoded = message.decode('cp1251')
repr(decoded)
>>> u'\u041f\u0440\u0438\u0432\u0435\u0442'
print decoded
Привет
message = '\x8f\xe0\xa8\xa2\xa5\xe2'
decoded = unicode(message, 'cp866')
>>> Привет
How to turn a Unicode into 8-bit string
udata = u'Сообщение'
udata.encode('cp1251')
>>> '\xd1\xee\xee\xe1\xf9\xe5\xed\xe8\xe5'
udata.encode('utf-8')
>>> '\xd0\xa1\xd0\xbe\xd0\xbe\xd0\xb1\xd1\x89\xd0\xb5\xd0\xbd\xd0\xb8'
Encode/decode chaining
u'Иван'.encode('cp866').decode('cp866')
>>> u'\u0418\u0432\u0430\u043d'
'\x88\xa2\xa0\xad'.decode('cp866').encode('utf-8')
>>> '\xd0\x98\xd0\xb2\xd0\xb0\xd0\xbd'
Encode/Decode errors
>>> u'Ivan'.encode('cp1251')
'Ivan'
>>> u'Иван'.encode('cp1251')
'\xc8\xe2\xe0\xed'
>>> u'任何字符串在中國'.decode('cp1251')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)
The meaning of strings
- Unicode is a human string
- 8-bit string is just bytes
- Avoid printing bytes
- The result depends on locale, code page, etc
Don't mix unicode and 8-bit strings!
u'Привет, ' + 'Ivan'
Привет, Ivan
u'Привет, ' + 'Иван'
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
name = '\xd0\x98\xd0\xb2\xd0\xb0\xd0\xbd'
u'Привет, ' + name.decode('utf-8')
>>> Привет, Иван
Methods are same
data = u'Юникод и Ко'
data.upper() # ЮНИКОД И КО
data.lower() # юникод и ко
data.split(u' ')
>>> [u'\u042e\u043d\u0438\u043a\u043e\u0434', u'\u0438', u'\u041a\u043e']
data.replace(u'Ко', u'Компания') # Юникод и Компания