UTF-8 is a byte encoding used to encode unicode characters. UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character. Remember, a unicode character is represented by a unicode code point. Thus, UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode code point.
UTF-8 is the a very commonly used textual encoding on the web, and is thus very popular. Web browsers understand UTF-8. Many programming languages also allow you to use UTF-8 in the code, and can import and export UTF-8 text easily. Several textual data formats and markup languages are often encoded in UTF-8. For instance JSON, XML, HTML, CSS, SVG etc.
UTF-8 Marker Bits and Code Point Bits
When translating a unicode code point to one or more UTF-8 encoded bytes, each of these bytes are composed of marker bits and code point bits. The marker bits tell how to interpret the given byte. The code point bits are used to represent the value of the code point. In the following sections the marker bits are written using 0‘s and 1‘s, and the code point bits are written using the characters Z
, Y
, X
, W
and V
. Each character represents a single bit.
Unicode Code Point Intervals Used in UTF-8
For unicode code points in the hexadecimal value interval U+0000
to U+007F
UTF-8 uses a single byte to represent the character. The code points in this interval represent the same characters as the ASCII characters, and use the same integer values (code points) to represent them. In binary digits, the single byte representing a code point in this interval looks like this:
0ZZZZZZZ
The marker bit has the value 0. The bits representing the code point value are marked with Z
.
For unicode code points in the interval U+0080
to U+07FF
UTF-8 uses two bytes to represent the character. In binary digits, the two bytes representing a code point in this interval look like this:
110YYYYY 10ZZZZZZ
The marker bits are the 110
and 10
bits of the two bytes. The Y
and Z
characters represents the bits used to represent the code point value. The first byte (most significant byte) is the byte to the left.
For unicode code points in the interval U+0800
to U+FFFF
UTF-8 uses three bytes to represent the character. In binary digits, the three bytes representing a code point in this interval look like this:
1110XXXX 10YYYYYY 10ZZZZZZ
The marker bits are the 1110
and 10
bits of the three bytes. The X
, Y
and Z
characters the bits used to represent the code point value. The first byte (most significant byte) is the byte to the left.
For unicode code points in the interval U+10000
to U+10FFFF
UTF-8 uses four bytes to represent the character. In binary digits, the four bytes representing a code point in this interval look like this:
11110VVV 10WWXXXX 10YYYYYY 10ZZZZZZ
The marker bits are the 11110
and 10
bits of the four bytes. The bits named V
and W
mark the code point plane the character is from. The rest of the bits marked with X
, Y
and Z
represent the rest of the code point. The first byte (most significant byte) is the byte on the left.
Reading UTF-8
When reading UTF-8 encoded bytes into characters, you need to figure out if a given character (code point) is represented by 1, 2, 3 or 4 bytes. You do so by looking at the bit pattern of the first byte.
If the first byte has the bit pattern 0ZZZZZZZ
(most significant byte is a 0) then the character code point is represented only by this byte.
If the first byte has the bit pattern 110YYYYY
(3 most significant bits are 110) then the character code point is represented by two bytes.
If the first byte has the bit pattern 1110XXXX
(4 most significant bits are 1110) then the character code point is represented by three bytes.
If the first byte has the bit pattern 11110VVV
(5 most significant bits are 11110) then the character code point is represented by four bytes.
Once you know how many bytes is used to represent the given character code point, read all the actual code point carrying bits (bits marked with V
, W
, X
, Y
and Z
), into a single 32 bit data type (e.g a Java int
). The bits then make up the integer value of the code point. Here is how a 32-bit data type looks after reading a 4-byte UTF-8 character into it:
000000 000VVVWW XXXXYYYY YYZZZZZZ
Notice how all the marker bits (the most significant bits with the patterns 11110
and 10
) have been removed from all of the 4 bytes, before the remaining bits (the bits marked with A, B, C, D and E) are copied into the 32-bit data type.
Writing UTF-8
When writing UTF-8 text you need to translate unicode code points into UTF-8 encoded bytes. First, you must figure out how many bytes you need to represent the given code point. I have explained the code point value intervals at the top of this UTF-8 tutorial, so I will not repeat them here.
Second, you need to translate the bits representing the code point into the corresponding UTF-8 bytes. Once you know how many bytes are needed to represent the code point, you also know what bit pattern of marker bits and code point bits you need to use. Simply create the needed number of bytes with marker bits, and copy the correct code point bits into each of the bytes, and you are done.
Here is an example of translating a code point that requires 4 bytes in UTF-8. The code point has the abstract value (as bit pattern):
000000 000VVVWW XXXXYYYY YYZZZZZZ
The corresponding 4 UTF-8 bytes will look like this:
11110VVV 10WWXXXX 10YYYYYY 10ZZZZZZ
Searching Forwards in UTF-8
Searching forwards in UTF-8 is reasonably straightforward. You encode one character at a time, and compare it to the character you are searching for. No big surprise here.
Searching Backwards in UTF-8
The UTF-8 encoding has the nice side effect that you can search backwards in UTF-8 encoded bytes. You can see from each byte if it is the beginning of a character or not by looking at the marker bits. The following marker bit patterns all imply that the byte is the beginning of a character:
0 Beginning of 1 byte character (also an ascii character) 110 Beginning of 2 byte character 1110 Beginning of 3 byte character 11110 Beginning of 4 byte character
The following marker bit pattern implies that the byte is not the first byte of a UTF-8 character:
10 Second, third or fourth byte of a UTF-8 character
Notice how you can always see from a marker bit pattern if it is the first byte of a character, or a second / third / fourth byte. Just keeping searching backwards until you find the beginning of the character, then go forward and decode it, and check if it is the character you are looking for.
---------------------------------------------------------------------------------
Unicode
Jakob Jenkov |
Unicode is an encoding for textual characters which is able to represent characters from many different languages from around the world. Each character is represented by a unicode code point. A code point is an integer value that uniquely identifies the given character. Unicode characters can be encoded using different encodings, like UTF-8 or UTF-16. These encodings specify how each character‘s Unicode code point is encoded, as one or more bytes. Each encoding will represent the characters as bytes according to their own scheme.
Unicode Code Points
As mentioned earlier, each unicode character is represented by a unicode code point which is an integer value. The code point integer values go from 0
to 10FFFF
(in hexadecimal encoding).
When referring to a unicode code point in writing, we write a U+
and then the hexadecimal representation of the code point. For instance, the uppercase character A
is represented as U+0041
. This notation is only when referring to the code points in text, though.
On the byte encoding level the unicode characters are encoded differently. The uppercase character A
does not need 6 bytes (6 ascii characters) when encoded as bytes. Again, the exact number of bytes used depends on whether you are using an UTF-8, UTF-16 encoding etc.
To create a text using unicode characters you use a series of unicode code points. For instance, the sequence U+0041
U+0042
U+0043
makes up the text ABC
.
Special Characters
Unicode contains some special characters which do not represent textual characters. These non-textual characters are typically located in certain intervals of the unicode value space. For instance:
Interval | Description |
---|---|
U+0000 - U+001F |
Control characters |
U+007F - U+009F |
Control characters |
U+DB00 - U+DFFF |
Surrogate pairs |
U+E000 - U+F8FF |
Private use area |
U+F0000 - U+FFFFF |
Private use area |
U+100000 - U+10FFFF |
Private use area |
Some unicode code points are not themselves characters. Instead they are combined with the preceding unicode character to alter the character. For instance, a character with an accent over it could be represented by first the character code point followed by the accent code point. Rather than displaying this as two characters, these two code points would be combined into the first character with the accent displayed on top of it.
Private use areas has no characters assigned to them by the unicode standard. Private use areas can be used to assign characters in your own context (should you need to), by following a standard procedure for how this is done.
Unicode Planes
Unicode code points are divided into sections which are called unicode planes. These unicode planes are indexed from 0 to 10 (in hexadecimal encoding, meaning there are 17 total unicode planes). You can see which unicode plane a given code point belongs to by writing the code point up as 6 hexadecimal digits, and looking at the first 2 digits. If a code point is too small to take up 6 hexadecimal digits, add zeros in front of the number until it is 6 digits long.
As example, the unicode code point U+0041
would become U+000041
of which the first two hexadecimal digits are 00
. Thus the unicode code point U+0041
belongs to unicode plane 0
.
Along the same logic, the code point U+10FFFF
is already 6 hexadecimal digits long, and thus does not need any zeroes added in front of it. The first two hexadecimal digits are 10
which translates to 16 in decimal digits. Thus, the code point U+10FFFF
belongs to unicode plane 16.
Non-character Code Points
The last 2 characters of each unicode plane are non-characters.