Java中如何正确计算字符串的长度？

2024-04-19

我知道有String#length以及其中的各种方法Character它或多或少适用于代码单元/代码点。

Java 中实际返回 Unicode 标准指定的结果的建议方法是什么（UAX#29 http://www.unicode.org/reports/tr29/），考虑到语言/区域设置、规范化和字素簇等因素？

Java字符串长度的正常模型

String.length() is 指定的作为返回的数量char字符串中的值（“代码单元”）。那就是最普遍有用的Java 字符串长度的定义；见下文。

Your description¹ of the semantics of length based on the size of the backing array/array slice is incorrect. The fact that the value returned by length() is also the size of the backing array or array slice is merely an implementation detail of typical Java class libraries. String does not need to be implemented that way. Indeed, I think I've seen Java String implementations where it WASN'T implemented that way.

字符串长度的替代模型。

要获取字符串中 Unicode 代码点的数量，请使用str.codePointCount(0, str.length()) -- see javadoc http://download.oracle.com/javase/6/docs/api/java/lang/String.html#codePointCount%28int,%20int%29.

To get the size (in bytes) of a String in a specific encoding (i.e. charset) use str.getBytes(charset).length².

要处理特定于区域设置的问题，您可以使用Normalizer http://download.oracle.com/javase/6/docs/api/java/text/Normalizer.html将字符串规范化为最适合您的用例的任何形式，然后使用codePointCount如上。但在某些情况下，即使这样也行不通。例如Unicode 标准显然不满足匈牙利字母计数规则。

使用 String.length() 一般就可以了

大多数应用程序使用的原因String.length()最大的问题是大多数应用程序不关心以人类为中心的方式计算单词、文本等中的字符数。例如，如果我这样做：

String s = "hi mum how are you";
int pos = s.indexOf("mum");
String textAfterMum = s.substring(pos + "mum".length());

这真的没关系"mum".length()没有返回代码点或者它不是语言上正确的字符计数。它使用适合手头任务的模型来测量绳子的长度。它有效。

显然，当您进行多语言文本分析时，事情会变得更加复杂；例如寻找词语。但即便如此，如果您在开始之前规范化文本和参数，则大多数时候您可以安全地使用“代码单元”而不是“代码点”进行编码； IE。length()仍然有效。

^{1 - This description was on some versions of the question. See the edit history ... if you have sufficient rep points.

2 - Using str.getBytes(charset).length entails doing the encoding and throwing it away. There is possibly a general way to do this without that copy. It would entail wrapping the String as a CharBuffer, creating a custom ByteBuffer with no backing to act as a byte counter, and then using Encoder.encode(...) to count the bytes. Note: I have not tried this, and I would not recommend trying unless you have clear evidence that getBytes(charset) is a significant performance bottleneck.}

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)