Java字符串长度的正常模型
String.length()
is 指定的作为返回的数量char
字符串中的值(“代码单元”)。那就是最普遍有用的Java 字符串长度的定义;见下文。
Your description1 of the semantics of length
based on the size of the backing array/array slice is incorrect. The fact that the value returned by length()
is also the size of the backing array or array slice is merely an implementation detail of typical Java class libraries. String
does not need to be implemented that way. Indeed, I think I've seen Java String implementations where it WASN'T implemented that way.
字符串长度的替代模型。
要获取字符串中 Unicode 代码点的数量,请使用str.codePointCount(0, str.length())
-- see javadoc http://download.oracle.com/javase/6/docs/api/java/lang/String.html#codePointCount%28int,%20int%29.
To get the size (in bytes) of a String in a specific encoding (i.e. charset) use str.getBytes(charset).length
2.
要处理特定于区域设置的问题,您可以使用Normalizer http://download.oracle.com/javase/6/docs/api/java/text/Normalizer.html将字符串规范化为最适合您的用例的任何形式,然后使用codePointCount
如上。但在某些情况下,即使这样也行不通。例如Unicode 标准显然不满足匈牙利字母计数规则。
使用 String.length() 一般就可以了
大多数应用程序使用的原因String.length()
最大的问题是大多数应用程序不关心以人类为中心的方式计算单词、文本等中的字符数。例如,如果我这样做:
String s = "hi mum how are you";
int pos = s.indexOf("mum");
String textAfterMum = s.substring(pos + "mum".length());
这真的没关系"mum".length()
没有返回代码点或者它不是语言上正确的字符计数。它使用适合手头任务的模型来测量绳子的长度。它有效。
显然,当您进行多语言文本分析时,事情会变得更加复杂;例如寻找词语。但即便如此,如果您在开始之前规范化文本和参数,则大多数时候您可以安全地使用“代码单元”而不是“代码点”进行编码; IE。length()
仍然有效。
1 - This description was on some versions of the question. See the edit history ... if you have sufficient rep points.
2 - Using str.getBytes(charset).length
entails doing the encoding and throwing it away. There is possibly a general way to do this without that copy. It would entail wrapping the String
as a CharBuffer
, creating a custom ByteBuffer
with no backing to act as a byte counter, and then using Encoder.encode(...)
to count the bytes. Note: I have not tried this, and I would not recommend trying unless you have clear evidence that getBytes(charset)
is a significant performance bottleneck.