16.12.1 Problem
You want to
read Unicode-encoded characters from a file, database, or form; or, you want to
write Unicode-encoded characters.
16.12.2 Solution
print utf8_encode('Kurt Gödel is swell.');
Use utf8_decode( ) to convert UTF-8 encoded characters to single-byte
ISO-8859-1 encoded characters:
print utf8_decode("Kurt G\xc3\xb6del is swell.");
16.12.3 Discussion
There are 256 possible ASCII characters. The characters between codes 0 and 127 are
standardized: control characters, letters and numbers, and punctuation. There
are different rules, however, for the characters that codes 128-255 map to. One
encoding is called ISO-8859-1, which includes characters necessary for writing
most European languages, such as the ö in Gödel or the ñ
in pestaña. Many languages, though, require more than 256 characters, and a
character set that can express more than one language requires even more
characters. This is where Unicode saves the day; its UTF-8 encoding can represent more than a million
characters.
This increased functionality comes at the cost of space. ASCII
characters are stored in just one byte; UTF-8 encoded characters need up to four
bytes. Table 16-2 shows the
byte representations of UTF-8 encoded characters.
Character code range
|
Bytes used
|
Byte 1
|
Byte 2
|
Byte 3
|
Byte 4
|
|---|---|---|---|---|---|
0x00000000 - 0x0000007F
|
1
|
0xxxxxxx
|
|||
0x00000080 - 0x000007FF
|
2
|
110xxxxx
|
10xxxxxx
|
||
0x00000800 - 0x0000FFFF
|
3
|
1110xxxx
|
10xxxxxx
|
10xxxxxx
|
|
0x00010000 - 0x001FFFFF
|
4
|
11110xxx
|
10xxxxxx
|
10xxxxxx
|
10xxxxxx
|