Difference between MBCS and UTF-8 on Windows

2023-11-18

I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ? What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ? Also, I found the following quote in MSDN:Unicode is a 16-bit character encoding

This negates whatever I read about the Unicode. I thought unicode can be encoded with different encodings such as UTF-8 and UTF-16. Can somebody shed some more light on this confusion?


I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ?

Many functions in the Windows API come in two versions: One that takes char parameters (in a locale-specific code page) and one that takes wchar_t parameters (in UTF-16).

int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType); 

int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);

Each of these function pairs also has a macro without the suffix, that depends on whether the UNICODE macro is defined.

#ifdef UNICODE

#define MessageBox MessageBoxW 

#else #define MessageBox MessageBoxA 

#endif

In order to make this work, the TCHAR type is defined to abstract away the character type used by the API functions.

#ifdef UNICODE 

typedef wchar_t TCHAR; 

#else typedef char TCHAR; 

#endif

This, however, was a bad idea. You should always explicitly specify the character type.


What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ?

MBCS stands for "multi-byte character set". For the literal minded, it seems that UTF-8 would qualify.

But in Windows, "MBCS" only refers to character encodings that can be used with the "A" versions of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987), and 950 (Big5), but NOT UTF-8.

To use UTF-8, you have to convert the string to UTF-16 using MultiByteToWideChar, call the "W" version of the function, and call WideCharToMultiByte on the output. This is essentially what the "A" functions actually do, which makes me wonder why the fuck Windows doesn't just support UTF-8.

This inability to support the most common character encoding makes the "A" version of the Windows API useless. Therefore, you should always use the "W" functions.


Unicode is a 16-bit character encoding

MSDN is wrong. Unicode is a 21-bit coded character set that has several encodings, the most common being UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings as well, such as GB18030, UTF-7, and UTF-EBCDIC.)

Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early adopter of Unicode, back when 16 bits was thought to be enough for everyone, and UTF-8 was only used on Plan 9. So UCS-2 was Unicode.


MBCS means Multi-Byte Character Set and describes any character set where a character is encoded into (possibly) more than 1 byte.

The ANSI / ASCII character sets are not multi-byte.

UTF-8, however, is a multi-byte encoding. It encodes any Unicode character as a sequence of 1, 2, 3, or 4 octets (bytes).

However, UTF-8 is only one out of several possible concrete encodings of the Unicode character set. Notably, UTF-16 is another, and happens to be the encoding used by Windows / .NET (IIRC). Here's the difference between UTF-8 and UTF-16:

  • UTF-8 encodes any Unicode character as a sequence of 1, 2, 3, or 4 bytes.

  • UTF-16 encodes most Unicode characters as 2 bytes, and some as 4 bytes.

It is therefore not correct that Unicode is a 16-bit character encoding. It's rather something like a 21-bit encoding (or even more these days), as it encompasses a character set with code points U+000000 up to U+10FFFF.


http://hi.baidu.com/sei_zhouyu/item/a5401fce5fe9ff000bd93a63

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Difference between MBCS and UTF-8 on Windows 的相关文章

随机推荐

  • 【八股】2023秋招八股复习笔记1(CSBase+部分WXG题)

    文章目录 MYSQL redis 网络 系统 安全 C 招聘要求 x3 部分面经和题目 WXG 后端 x5 MYSQL redis redis memcached mysql 线程模型 6 0多线程 持久化 AOF RDB 功能 过期删除
  • aspx页面添加引用代码

  • Windows远程桌面连接报内部错误

    远程桌面连接出现了内部错误解决方法 1 运行里输入ncpa cpl命令 打开网络连接 2 禁用 启用一下 当前的网卡 3 再通过命令 mstsc 打开远程桌面服务 报错问题解决
  • 基于LabVIEW的音频信号采集分析系统

    本设计基于LabVIEW虚拟仪器开发软件 用PC的声卡与外接麦克风组合采集到外界的声音信息 并保存到WAV文件中 再利用LabVIEW软件进行编程来对采集到的信号进行分析处理 能够显示采集到的波形 滤波后的波形以及其幅度 相位谱和功率谱波形
  • 一个简单的基于epoll的web server

    一个简单的基于epoll的web server 性能还不错我根据一个epoll的模型改了一个http server出来 只有129行 还可以精简不少 呵呵 小测了一下 一秒钟处理了一万了请求 当然这里只是把现成的东西输出 没考虑到发送数据处
  • Qt多国语言动态切换(含源代码)

    Qt中文国际化 含高阶做法 作者 melon 日期 2019 7 15 1 国际化需要用到的工具 lrelease exe lupdate exe linguist exe 非必须 这些工具在Qt5 12 2的bin文件夹都可以找到 lup
  • Hibernate用法:查询,更新,删除!

    一 基本数据查询 使用Hibernate进行数据查询是一件简单的事 Java程序设计人员可以使用对象操作的方式来进行数据查询 查询时使用一种类似SQL的HQL Hibernate Query Language 来设定查询的条件 与SQL不同
  • Redis工具类

    public class RdsUtils Resource private static RedisTemplate redisTemplate 设置键值对 param key 键 param value 值 return public
  • Word中批量更新域的两个小方法

    如果只有一个需要更新 对着域右键选择 更新域 即可 很多需要更新的时候 可以如下操作 两种方法应该都可以 1 选择 打印预览 可以更新文档中的所有MOS认证的老师教的 2 CTRL A 全选 然后F9 更新 即可 自己觉得很好用的 批批更新
  • C#密码复杂性校验(二)

    以下是一个使用正则表达式进行密码复杂性校验的示例代码 using System using System Text RegularExpressions class Program static void Main string args
  • 《Unity Shader入门精要》彩图版免费分享~~~~~

    这书很多地方都要币或者要钱 这里就免费分享了 下面是网盘链接 顺手点个赞或者评论一波呗 下载链接 链接 https pan baidu com s 137Y1nkB6h8HIvKOfwFPnbQ 提取码 f8dw 顺手点个赞 蟹蟹蟹蟹
  • 测试人社区——软件测试技术沙龙分享

    作为软件开发领域中至关重要的一环 软件测试的重要性日益凸显 然而 随着软件测试开发技术的不断发展 软件测试也面临着越来越多的挑战 为了更好地应对这些挑战 测试人社区于2023年3月12日举办了技术沙龙 主题为 探索软件测试前沿技术及最佳实践
  • C++实现UDP可靠传输(一)

    声明 禁止以任何形式转载本文章 本文章仅供个人学习记录与交流探讨 文章中提供的思路只是一种解决方案 代码也并非完整代码 如有需要 请自行设计协议并完成编程任务 食用本文章之前 推荐阅读 C 实现流式socket聊天程序 目录 UDP协议的基
  • TypeError: super(type, obj): obj must be an instance or subtype of type 该错误的一次处理

    我在写maya类的时候遇到 版本python2 7 folder lib mayaclass py base类 A base 类 ui py 其中A继承base ui py文件中内容 python2 7 import imp from fo
  • Java是如何读取和写入浏览器Cookies的

    首先我们认识下什么是cookies cookie实际上是一个存在你硬盘里的数据 但是这些数据很特殊 只能由web应用提交给浏览器帮助存储 并且我们还能读取浏览器存在本地的cookie web应用一般只在cookie中存储一些用户信息等少量且
  • hive sql之将一行中的时间范围转成多行

    一 生成hive 临时表其中date time 包含其实日期和结束日期 with test1 as select A班 as class 2023 03 01 2023 03 14 as date time 100 99 89 as sco
  • CUBEMX+HAL库实现STM32串口(不定长度)收发

    CUBEMX HAL库实现STM32串口接收 不定长度 并发送 首先新建一个CUBEMX工程 选择你自己开发板的芯片型号 设置时钟和串口 设置时钟为72MHZ 设置串口中断 点击小齿轮生成KEIL5的工程 设置工程名称及存放位置 点击Ope
  • Calendar常用的方法的返回结果 获取指定时间点

    Calendar常用的方法的返回结果 public class CalendarText Calendar calendar null Before public void test calendar Calendar getInstanc
  • 和氟西汀类似的备注_氟西汀备注是什么意思 和氟西汀差不多的备注

    氟西汀 这是一个药物名字 是抗郁抑症的药物 这个药物名字也经常用来当做备注名使用 氟西汀备注是什么意思 和氟西汀差不多的备注有哪些 氟西汀备注是什么意思 将对方备注成氟西汀 表示的是对方是驱散自己心中阴霾的阳光 对方是自己活下去的动力 是快
  • Difference between MBCS and UTF-8 on Windows

    I am reading about the charater set and encodings on Windows I noticed that there are two compiler flags in Visual Studi