Online Tutorials: PHP : Internationalization and Localization - [16.12] Reading or Writing Unicode Characters

Tuesday, November 27, 2012

PHP : Internationalization and Localization - [16.12] Reading or Writing Unicode Characters

16.12.1 Problem

You want to read Unicode-encoded characters from a file, database, or form; or, you want to write Unicode-encoded characters.

16.12.2 Solution

Use utf8_encode( ) to convert single-byte ISO-8859-1 encoded characters to UTF-8:

print utf8_encode('Kurt Gödel is swell.');

Use utf8_decode( ) to convert UTF-8 encoded characters to single-byte ISO-8859-1 encoded characters:

print utf8_decode("Kurt G\xc3\xb6del is swell.");

16.12.3 Discussion

There are 256 possible ASCII characters. The characters between codes 0 and 127 are standardized: control characters, letters and numbers, and punctuation. There are different rules, however, for the characters that codes 128-255 map to. One encoding is called ISO-8859-1, which includes characters necessary for writing most European languages, such as the ö in Gödel or the ñ in pestaña. Many languages, though, require more than 256 characters, and a character set that can express more than one language requires even more characters. This is where Unicode saves the day; its UTF-8 encoding can represent more than a million characters.

This increased functionality comes at the cost of space. ASCII characters are stored in just one byte; UTF-8 encoded characters need up to four bytes. Table 16-2 shows the byte representations of UTF-8 encoded characters.

Table 16-2. UTF-8 byte representation
Character code range	Bytes used	Byte 1	Byte 2	Byte 3	Byte 4
`0x00000000 - 0x0000007F`	1	`0xxxxxxx`
`0x00000080 - 0x000007FF`	2	`110xxxxx`	`10xxxxxx`
`0x00000800 - 0x0000FFFF`	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
`0x00010000 - 0x001FFFFF`	4	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

In Table 16-2, the x positions represent bits used for actual character data. The least significant bit is the rightmost bit in the rightmost byte. In multibyte characters, the number of leading 1 bits in the leftmost byte is the same as the number of bytes in the character.

title besides title

Pages

Labels

About Me

Blogger news

Blogroll

Blogger templates

Tuesday, November 27, 2012

PHP : Internationalization and Localization - [16.12] Reading or Writing Unicode Characters

16.12.1 Problem

16.12.2 Solution

16.12.3 Discussion

Table 16-2. UTF-8 byte representation

Topics

Most Viewed