Юникод в Python

Видео может быть заблокировано из-за расширений браузера. В статье вы найдете решение этой проблемы.

Unicode in Python

Before the Unicode

  • Memory consists of bytes
  • A string is a chain of bytes
  • One byte can have up to 256 values
  • One byte can mark one of 256 symbols
  • To print a byte, you should find a symbol that it marks...
  • ... with a special table, named Code page

ASCII Code page

  • a special table that compares bytes with symbols
  • ASCII table as a standard
  • 0..127 for latin and system symbols
  • 128..255 for extended symbols

How about foreign languages?

  • 0..127 for latin and system symbols anyway
  • 128..255 for realm-specific symbols
  • One realm — one code page
  • cp866 DOS-Cyrillic
  • cp1251 Windows-Cyrillic
  • cp1253 Windows-Creek
  • etc

Encoding hell

  • Lots of code pages
  • Every message should declare the code page
  • Encode/decode errors, strange symbols instead of text
  • Remember your inbox 10 years ago

How does it fail

  • User A types "Привет" on Linux (utf-8)
  • The bytes are \xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82
  • The program should send a header with encoding, but it doesn't
  • User B receives the message without header on Windows (cp1251)
  • So default code page becomes cp1251
  • Decoded message is Привет
  • :--(

The solution is Unicode

  • It's a common unlimited alphabet
  • Every symbol has its own number
  • Unicode is growing
  • U+XXXX to refer a specific code
U+0031      1
U+00A9      ©
U+20AC      €
U+266B      ♫
U+4E64      乤
U+45B4      䖴

What is UTF?

  • An algorithm how turn Unicode into bytes and read it back
  • There are UTF-8, -16 and -32 encodings
  • UTF-8 uses 1 byte for latin and 2 bytes for non-latin
  • UTF-8 is compatible with english text
  • Can encode large subset of Unicode

Still have problems with Unicode?

  • Try to use UTF-8 encoding everywhere
  • Read Joel Spolsky "About Unicode and Character Sets"

Unicode in Python 2.x

  • Unicode strings are declared with u'...' prefix
  • The should be coding directive in your file
  • Use \uXXXX to refer unicode symbols 0..FFFF
  • \UXXXXXXXX to refer 0..FFFFFFFF symbols (rare)
# coding=utf-8           # -*- coding: utf-8 -*-
name = u'Иван'
name = u'Ιαννης'
name = u'João'
greet = u'\u041f\u0440\u0438\u0432\u0435\u0442'
print greet
>>> Привет
print u'\U0000041f\U00000440\U00000438\U00000432\U00000435\U00000442'
>>> Привет

data = u"""
Any string in english
Любая строка на русском

print data
Any string in english
Любая строка на русском

u'\nAny string in english\n\u041b\u044e\u0431\u0430\u044f \u0441\u0442\u0440\u043e\u043a\u0430 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c\n\u4efb\u4f55\u5b57\u7b26\u4e32\u5728\u4e2d\u570b\n'

How to turn an object into Unicode

unicode([1, 2, True, {"test": 42}])
>>> u"[1, 2, True, {'test': 42}]"

unicode(obj) == obj.__unicode__()

class User(object):
    def __unicode__(self):
        return u'%s %s' % (self.name, self.surname)

user = User(...)
unicode(user)  # u'Михаил Паниковский'

How to turn a 8-bit string into Unicode

message = '\xcf\xf0\xe8\xe2\xe5\xf2'
decoded = message.decode('cp1251')
>>> u'\u041f\u0440\u0438\u0432\u0435\u0442'
print decoded

message = '\x8f\xe0\xa8\xa2\xa5\xe2'
decoded = unicode(message, 'cp866')
>>> Привет

How to turn a Unicode into 8-bit string

udata = u'Сообщение'
>>> '\xd1\xee\xee\xe1\xf9\xe5\xed\xe8\xe5'
>>> '\xd0\xa1\xd0\xbe\xd0\xbe\xd0\xb1\xd1\x89\xd0\xb5\xd0\xbd\xd0\xb8'

Encode/decode chaining

>>> u'\u0418\u0432\u0430\u043d'

>>> '\xd0\x98\xd0\xb2\xd0\xb0\xd0\xbd'

Encode/Decode errors

>>> u'Ivan'.encode('cp1251')

>>> u'Иван'.encode('cp1251')

>>> u'任何字符串在中國'.decode('cp1251')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)

The meaning of strings

  • Unicode is a human string
  • 8-bit string is just bytes
  • Avoid printing bytes
  • The result depends on locale, code page, etc

Don't mix unicode and 8-bit strings!

u'Привет, ' + 'Ivan'
Привет, Ivan

u'Привет, ' + 'Иван'
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

name = '\xd0\x98\xd0\xb2\xd0\xb0\xd0\xbd'
u'Привет, ' + name.decode('utf-8')
>>> Привет, Иван

Methods are same

data = u'Юникод и Ко'
data.upper()                # ЮНИКОД И КО
data.lower()                # юникод и ко
data.split(u' ')
>>> [u'\u042e\u043d\u0438\u043a\u043e\u0434', u'\u0438', u'\u041a\u043e']
data.replace(u'Ко', u'Компания')       # Юникод и Компания
Мы учим программированию с нуля до стажировки и работы. Попробуйте наш бесплатный курс «Введение в программирование» или полные программы обучения по Node, PHP, Python и Java.


Подробнее о том, почему наше обучение работает →