Введение в Python
Теория: Юникод в Python
Полный доступ к материалам
Unicode in Python
Before the Unicode
- Memory consists of bytes
- A string is a chain of bytes
- One byte can have up to 256 values
- One byte can mark one of 256 symbols
- To print a byte, you should find a symbol that it marks...
- ... with a special table, named
Code page
ASCII Code page
- a special table that compares bytes with symbols
- ASCII table as a standard
- 0..127 for latin and system symbols
- 128..255 for extended symbols
How about foreign languages?
- 0..127 for latin and system symbols anyway
- 128..255 for realm-specific symbols
- One realm — one code page
cp866DOS-Cyrilliccp1251Windows-Cyrilliccp1253Windows-Creek- etc
Encoding hell
- Lots of code pages
- Every message should declare the code page
- Encode/decode errors, strange symbols instead of text
- Remember your inbox 10 years ago
How does it fail
- User A types
"Привет"on Linux (utf-8) - The bytes are
\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82 - The program should send a header with encoding, but it doesn't
- User B receives the message without header on Windows (cp1251)
- So default code page becomes cp1251
- Decoded message is
Привет - :--(
The solution is Unicode
- It's a common unlimited alphabet
- Every symbol has its own number
- Unicode is growing
U+XXXXto refer a specific code
U+0031 1
U+00A9 ©
U+20AC €
U+266B ♫
U+4E64 乤
U+45B4 䖴
What is UTF?
- An algorithm how turn Unicode into bytes and read it back
- There are
UTF-8,-16and-32encodings UTF-8uses 1 byte for latin and 2 bytes for non-latinUTF-8is compatible with english text- Can encode large subset of Unicode
Still have problems with Unicode?
- Try to use
UTF-8encoding everywhere - Read Joel Spolsky "About Unicode and Character Sets"
Unicode in Python 2.x
- Unicode strings are declared with
u'...'prefix - The should be coding directive in your file
- Use
\uXXXXto refer unicode symbols0..FFFF \UXXXXXXXXto refer0..FFFFFFFFsymbols (rare)
How to turn an object into Unicode
How to turn a 8-bit string into Unicode
How to turn a Unicode into 8-bit string
Encode/decode chaining
Encode/Decode errors
The meaning of strings
- Unicode is a human string
- 8-bit string is just bytes
- Avoid printing bytes
- The result depends on locale, code page, etc
Don't mix unicode and 8-bit strings!
Methods are same