试图以一致的方式处理世界上所有的书写系统的小烦恼之一是,实际上你认为你所了解的关于字符的知识实际上都是正确的。这使得执行“不区分大小写的比较”之类的事情变得很棘手。事实上,进行任何形式的区域设置比较都是很棘手的,而且不区分大小写也很棘手。
不过,在一些限制下,这是可以实现的。所需的算法可以使用正常的编程实践(以及一些静态数据的预计算)来“有效”地实现,但它不能像不正确的算法那样有效地实现。通常可以牺牲正确性来换取速度,但结果并不令人愉快。不正确但快速的语言环境实现可能会吸引那些语言环境正确实现的人,但对于语言环境产生意外结果的部分受众来说显然不能令人满意。
字典顺序对人类不起作用
对于具有大小写的语言,大多数语言环境(“C”语言环境除外)已经以预期的方式处理字母大小写,即仅在考虑所有其他差异后才使用大小写差异。也就是说,如果单词列表按照区域设置的排序顺序进行排序,则列表中仅大小写不同的单词将是连续的。大写单词位于小写单词之前还是之后取决于区域设置,但中间不会有其他单词。
该结果无法通过任何单遍从左到右逐个字符的比较(“字典顺序”)来实现。而且大多数语言环境都有其他排序规则的怪癖,这些怪癖也不会屈服于天真的词典顺序。
如果您有适当的区域设置定义,标准 C++ 排序规则应该能够处理所有这些问题。但它不能仅仅使用对成对的比较函数来简化为字典顺序比较whar_t
,因此 C++ 标准库不提供该接口。
以下只是说明为什么区域设置感知排序规则如此复杂的几个示例;更长的解释,以及更多的例子,可以在Unicode 技术标准 10.
口音去哪儿了?
大多数浪漫语言(以及英语,在处理借用词时)都认为元音之上的重音是一种次要特征;也就是说,首先对单词进行排序,就像不存在重音符号一样,然后进行第二次排序,其中非重音字母位于重音字母之前。需要第三遍来处理大小写,这在前两遍中被忽略。
But that doesn't work for Northern European languages. The alphabets of Swedish, Norwegian and Danish have three extra vowels, which follow z in the alphabet. In Swedish, these vowels are written å, ä, and ö; in Norwegian and Danish, these letters are written å, æ, and ø, and in Danish å is sometimes written aa, making Aarhus the last entry in an alphabetical list of Danish cities.
In German, the letters ä, ö, and ü are generally alphabetised as with romance accents, but in German phonebooks (and sometimes other alphabetical lists), they are alphabetised as though they were written ae, oe and ue, which is the older style of writing the same phonemes. (There are many pairs of common surnames such as "Müller" and "Mueller" are pronounced the same and are often confused, so it makes sense to intercollate them. A similar convention was used for Scottish names in Canadian phonebooks when I was young; the spellings M'
, Mc
and Mac
were all clumped together since they are all phonetically identical.)
一个符号,两个字母。或者两个字母,一个符号
German also has the symbol ß which is collated as though it were written out as ss, although it is not quite identical phonetically. We'll meet this interesting symbol again a bit later.
In fact, many languages consider digraphs and even trigraphs to be single letters. The 44-letter Hungarian alphabet includes Cs, Dz, Dzs, Gy, Ly, Ny, Sz, Ty, and Zs, as well as a variety of accented vowels. However, the language most commonly referenced in articles about this phenomenon -- Spanish -- stopped treating the digraphs ch and ll as letters in 1994, presumably because it was easier to force Hispanic writers to conform to computer systems than to change the computer systems to deal with Spanish digraphs. (Wikipedia claims it was pressure from "UNESCO and other international organizations"; it took quite a while for everyone to accept the new alphabetization rules, and you still occasionally find "Chile" after "Colombia" in alphabetical lists of South American countries.)
总结:比较字符串需要多遍,有时需要比较字符组
使其全部不区分大小写
由于相比之下,区域设置可以正确处理大小写,因此实际上没有必要执行不区分大小写的排序。进行不区分大小写的等价类检查(“相等”测试)可能很有用,尽管这提出了其他哪些不精确的等价类可能有用的问题。 Unicode 规范化、重音删除、甚至转录为拉丁语在某些情况下都是合理的,但在其他情况下却非常烦人。但事实证明,大小写转换也不像您想象的那么简单。
Because of the existence of di- and trigraphs, some of which have Unicode codepoints, the Unicode standard actually recognizes three cases, not two: lower-case, upper-case and title-case. The last is what you use to upper case the first letter of a word, and it's needed, for example, for the Croatian digraph dž (U+01C6; a single character), whose uppercase is DŽ (U+01C4) and whose title case is Dž (U+01C5). The theory of "case-insensitive" comparison is that we could transform (at least conceptually) any string in such a way that all members of the equivalence class defined by "ignoring case" are transformed to the same byte sequence. Traditionally this is done by "upper-casing" the string, but it turns out that that is not always possible or even correct; the Unicode standard prefers the use of the term "case-folding", as do I.
C++ 语言环境不能完全胜任这项工作
因此,回到 C++,可悲的事实是 C++ 语言环境没有足够的信息来进行准确的大小写折叠,因为 C++ 语言环境的工作原理是假设字符串的大小写折叠仅包含顺序和单独的大写字母字符串中的每个代码点都使用将一个代码点映射到另一个代码点的函数。正如我们将看到的,这根本行不通,因此其效率问题是无关紧要的。另一方面,重症监护病房图书馆有一个接口,可以像 Unicode 数据库允许的那样正确地进行大小写折叠,并且它的实现是由一些非常优秀的编码人员精心设计的,因此它可能在限制范围内尽可能高效。所以我绝对推荐使用它。
如果您想很好地了解折叠案例的难度,您应该阅读 5.18 和 5.19 节统一码标准 (第 5 章的 PDF)。以下仅举几个例子。
大小写转换不是从单个字符到单个字符的映射
The simplest example is the German ß (U+00DF), which has no upper-case form because it never appears at the beginning of a word, and traditional German orthography didn't use all-caps. The standard upper-case transform is SS (or in some cases SZ) but that transform is not reversible; not all instances of ss are written as ß. Compare, for example, grüßen and küssen (to greet and to kiss, respectively). In v5.1, ẞ, an "upper-case ß, was added to Unicode as U+1E9E, but it is not commonly used except in all-caps street signs, where its use is legally mandated. The normal expectation of upper-casing ß would be the two letters SS
.
并非所有表意文字(可见字符)都是单字符代码
Even when a case transform maps a single character to a single character, it may not be able to express that as a wchar→wchar
mapping. For example, ǰ can easily be capitalized to J̌, but the former is a single combined glyph (U+01F0), while the second is a capital J with a combining caron (U+030C).
There is a further problem with glyphs like ǰ:
天真的逐字符大小写折叠可能会导致非规范化
Suppose we upper-case ǰ as above. How do we capitalize ǰ̠ (which, in case it doesn't render properly on your system, is the same character with an bar underneath, another IPA convention)? That combination is U+01F0,U+0320 (j with caron, combining minus sign below), so we proceed to replace U+01F0 with U+004A,U+030C and then leave the U+0320 as is: J̠̌. That's fine, but it won't compare equal to a normalized capital J with caron and minus sign below, because in the normal form the minus sign diacritic comes first: U+004A,U+0320,U+030C (J̠̌, which should look identical). So sometimes (rarely, to be honest, but sometimes) it is necessary to renormalize.
撇开 unicode 的怪异不谈,有时大小写转换是上下文相关的
Greek has a lot of examples of how marks get shuffled around depending on whether they are word-initial, word-final or word-interior -- you can read more about this in chapter 7 of the Unicode standard -- but a simple and common case is Σ, which has two lower-case versions: σ and ς. Non-greeks with some maths background are probably familiar with σ, but might not be aware that it cannot be used at the end of a word, where you must use ς.
In short
大小写折叠的最佳可用正确方法是应用 Unicode 大小写折叠算法,该算法需要为每个源字符串创建一个临时字符串。然后,您可以在两个转换后的字符串之间进行简单的字节比较,以验证原始字符串是否位于同一等价类中。对转换后的字符串进行排序规则虽然可能,但效率比对原始字符串进行排序规则要低得多,并且出于排序目的,未转换的比较可能与转换后的比较一样好或更好。
理论上,如果您只对大小写相等感兴趣,则可以线性进行转换,请记住转换不一定是上下文无关的,也不是简单的字符到字符映射函数。不幸的是,C++ 语言环境不向您提供执行此操作所需的数据。 Unicode CLDR 更接近,但它是一个复杂的数据结构。
所有这些东西都非常复杂,并且充满了边缘情况。 (请参阅 Unicode 标准中有关立陶宛语重音的注释i
例如。)您最好只使用维护良好的现有解决方案,其中最好的例子是 ICU。