SysconsUnicodeProject/UnicodeBasics

What is Unicode: Very Brief and Unformal

Unicode is a specification that describes text representation. In particular, it specifies a list of many standardized characters. Each character is assigned a number. The appearance of a character is not specified, only it's name, number and some properties (is letter capital or not, etc).

For example: "042A - Cyrillic Capital Letter Hard Sign"

The number of assigned numbers is huge, that's why unicode requires at least 21 bit of storage per the single character.

Unicode also specifies several representations of text. The simplest representation - the sequence of numbers (most often 32-bit ones) - is highly space-comsuming. For english text it is 4 times worse than ASCII. That's why there are 3 standard forms of text representation: UTF-32, UTF-16 and UTF-8.

UTF-32 is just a plain sequence of character numbers. It requires 32 bits per storage unit, hence the name.

UTF-16 and UTF-8 can be though of as a kind of data compression, like gzip or bzip2. UTF-16 uses 16-bit storage units and UTF-8 uses 8-bit storage units. The size of the storage unit only has meaning internal to compression and decompression algorithms. All three forms are equally good for the purpose of preserving the whole text.

UTF-8 and UTF-16 differ from common data compression algorithms like zip flavors. They are specifically designed for text. The main difference between general data compression algos and compressed UTF forms is that data are usually stored in compressed form and are decompressed when used. On the contrary applications usually store text in memory using UTF-8 or UTF-16. It's possible to perform most of text processing directly on UTF-8 and UTF-16 text without having to decode them into any temporary storage.

UTF-8 has some characteristics making it very appropriate for UNIX-like systems:

Zero-value byte does never occur in the middle of the string.
UTF-8 overlaps with ASCII. Any valid ASCII text is valid UTF-8 text.
Any parser that tokenizes string on ASCII characters and preserves non-ASCII (values 128-255) character sequences requires no modification in order to start using it with UTF-8.

There are a couple of things that make Unicode harder to work with:

The length of the string is not equal to the number of bytes.
When dealing with monospaced characters, some characters can consume two columns. See wcwidth(3).
Certain characters can be written in two forms, namely composed and decomposed. An example is è, which can be written as a single character or as an e, followed by a `.

This description of Unicode is neither complete nor strict, for more details read:

http://unicode.org/

http://en.wikipedia.org/wiki/Unicode

SysconsUnicodeProject/UnicodeBasics (last edited 2009-01-21T18:53:55+0000 by EdSchouten)