我正在学习UTF-16编码,并且我读到如果你想表示U + 10000到U + 10FFFF范围内的代码点,那么你必须使用代理对,其范围在U + D800至 U+DFFF。
假设我想对以下代码点进行编码:U+10123(二进制为 10000000100100011):
首先我布局这个位序列:
110110xxxxxxxxxx 110111xxxxxxxxxx
然后我用代码点的二进制格式填充 x 的位置:
1101100001000000 1101110100100011(十六进制 D840 DD23)
我还读到 U+D800 到 U+DFFF 范围内的代码点已从 Unicode 字符集中删除,但我不明白为什么删除这个范围!
我的意思是这个范围可以很容易地编码为 4 个字节,例如以下是 U+D812 代码点的 UTF-16 编码格式(二进制为 1101100000010010):
1101100000110110 1101110000010010(D836 DC12 十六进制)
Note:我在示例中使用 UTF-16 Big Endian。
Codepoints U+D800 - U+DFFF are reserved exclusively1 for use with UTF-16. Since they are not in the range of U+10000 - U+10FFFF, UTF-16 would not encode them individually using surrogate pairs, so it would be ambiguous (and illegal2) for these individual codepoints to appear un-encoded in a UTF-16 sequence.
Per the Unicode.org UTF-16 常见问题解答 http://www.unicode.org/faq/utf_bom.html:
1: Q: What are surrogates? http://www.unicode.org/faq/utf_bom.html#utf16-1
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
2: Q: Are there any 16-bit values that are invalid? http://www.unicode.org/faq/utf_bom.html#utf16-7
A: Unpaired surrogates are invalid in UTFs. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16.
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)