It's because of UTF-16. Characters outside of the base multilingual plane (BMP) are represented using a surrogate pair https://en.wikipedia.org/wiki/UTF-16 in UTF-16 with the first code unit (CU) lies between 0xD800–0xDBFF and the second one between 0xDC00–0xDFFF. Each of the CU represents 10 bits of the code point, allowing total 20 bits of data (0x100000 characters) which is split into 16 planes (16×216 characters). The remaining BMP will represent 0x10000 characters (code points 0–0xFFFF)
Therefore the total number of characters is 17×216 = 0x100000 + 0x10000 = 0x110000 which allows for code points from 0 to 0x110000 - 1 = 0x10FFFF. Alternatively the last representable code point can be calculated like this: Code points in the BMP are in the range 0–0xFFFF, so the offset for characters encoded with a surrogate pair is 0xFFFF + 1 = 0x10000, which means the last code point that a surrogate pair represents is 0xFFFFF + 0x10000 = 0x10FFFF
这是由Unicode 字符编码稳定性策略 https://www.unicode.org/policies/stability_policy.html#Property_Value上面的代码点将永远不会被分配
General_Category 属性值代理 (Cs) 是不可变的:具有该值的代码点集永远不会改变。
历史上 UTF-8 允许使用 6 个字节最多 U+7FFFFFFF https://en.wikipedia.org/wiki/UTF-8#History而 UTF-32 可以存储的数量是它的两倍。然而,由于 UTF-16 的限制,Unicode 委员会决定 UTF-8 永远不能超过 4 个字节,从而导致与 UTF-16 的范围相同
2003 年 11 月,UTF-8 受 RFC 3629 限制以匹配 UTF-16 字符编码的约束 https://www.rfc-editor.org/rfc/rfc3629#page-11:明确禁止与高和低代理字符相对应的代码点删除了超过 3% 的三字节序列,并以 U+10FFFF 结尾删除了超过 48% 的四字节序列以及所有五字节和六字节序列序列。
https://en.wikipedia.org/wiki/UTF-8#History https://en.wikipedia.org/wiki/UTF-8#History
同样的情况也适用于 UTF-32
2003 年 11 月,Unicode 受到 RFC 3629 的限制,以匹配 UTF-16 编码的约束:明确禁止大于 U+10FFFF 的代码点(以及高和低代理项 U+D800 到 U+DFFF)。这个有限子集定义了 UTF-32
https://en.wikipedia.org/wiki/UTF-32 https://en.wikipedia.org/wiki/UTF-32
你可以阅读这个更详细的答案 https://www.quora.com/Why-does-Unicode-have-seventeen-planes-U-0000-to-U-10FFFF-which-sometimes-requires-a-sixth-digit-and-not-sixteen-U-0000-to-U-FFFFF and
- UTF-8、UTF-16 和 UTF-32 可以存储的字符数有何不同? https://stackoverflow.com/q/130438/995714
- Unicode 联盟是否打算让 UTF-16 字符耗尽? https://stackoverflow.com/q/9384120/995714
- Unicode 可以映射多少个字符? https://stackoverflow.com/q/5924105/995714
- 建议将代码位置范围限制为 U-0010FFFF 以内的值 http://www.unicode.org/L2/L2000/00079-n2175.htm