我有一个由其他程序的数据填充的字符串,该数据可以使用 UTF8 编码,也可以不使用。因此,如果不是,我可以编码为 UTF8,但是在 C++ 中检测 UTF8 的最佳方法是什么?我看到了这个变体https://stackoverflow.com/questions/... https://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c但有评论称该解决方案不能提供 100% 的检测。因此,如果我对已经包含 UTF8 数据的 UTF8 字符串进行编码,那么我会将错误的文本写入数据库。
那么我可以使用这个 UTF8 检测吗:
bool is_utf8(const char * string)
{
if(!string)
return 0;
const unsigned char * bytes = (const unsigned char *)string;
while(*bytes)
{
if( (// ASCII
// use bytes[0] <= 0x7F to allow ASCII control characters
bytes[0] == 0x09 ||
bytes[0] == 0x0A ||
bytes[0] == 0x0D ||
(0x20 <= bytes[0] && bytes[0] <= 0x7E)
)
) {
bytes += 1;
continue;
}
if( (// non-overlong 2-byte
(0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF)
)
) {
bytes += 2;
continue;
}
if( (// excluding overlongs
bytes[0] == 0xE0 &&
(0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
) ||
(// straight 3-byte
((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
bytes[0] == 0xEE ||
bytes[0] == 0xEF) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
) ||
(// excluding surrogates
bytes[0] == 0xED &&
(0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
)
) {
bytes += 3;
continue;
}
if( (// planes 1-3
bytes[0] == 0xF0 &&
(0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
) ||
(// planes 4-15
(0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
) ||
(// plane 16
bytes[0] == 0xF4 &&
(0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
)
) {
bytes += 4;
continue;
}
return 0;
}
return 1;
}
如果检测不正确,则此代码用于编码为 UTF8:
string text;
if(!is_utf8(EscReason.c_str()))
{
int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
text.length(), 0, 0);
std::wstring utf16_str(size, '\0');
MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
text.length(), &utf16_str[0], size);
int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), 0, 0, 0, 0);
std::string utf8_str(utf8_size, '\0');
WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), &utf8_str[0], utf8_size, 0, 0);
text = utf8_str;
}
或者上面的代码没有正确完成?我也在 Windows 7 中执行此操作。Ubuntu 怎么样?这个变体在那里工作吗?