如何使用C来限制SubString、Limit？ [关闭]

2024-05-03

第1节

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

int main(int argc, char **argv)
{
    static const unsigned char text[] = "000ßh123456789";
    int32_t current=1;
    int32_t text_len = strlen(text)-1;
    /////////////////////////////////
    printf("Result : %s\n",text);
    /////////////////////////////////
    printf("Lenght : %d\n",text_len);
    /////////////////////////////////
    printf("Index0 : %c\n",text[0]);
    printf("Index1 : %c\n",text[1]);
    printf("Index2 : %c\n",text[2]);
    printf("Index3 : %c\n",text[3]);//==> why show this `�`?
    printf("Index4 : %c\n",text[4]);//==> why show this `�`?
    printf("Index0 : %c\n",text[5]);
    /////////////////////////////////
    return 0;
}

why text[3] and text[4] show �?

怎么也支持utf-8字符Index?

第#2节

我想写一个像这样的函数mb_substr in php.

(verybigstring or string)mb_substr ((verybigstring or string)输入，(verybigint or int)开始 [，(verybigint or int)$长度=空]）

一些例子：

mb_substr("你好世界",0);

==>hello world
mb_substr("你好世界",1);

==>ello world
mb_substr_two("你好世界",1,3);

==>el
mb_substr("你好世界",-3);

==>rld
mb_substr_two("你好世界",-3,2);

==>rldhe

我的问题是第 1 节

有人可以帮助我吗？（请）

The Unicode https://en.wikipedia.org/wiki/Unicode字符集当前包含超过 128,000 个字符（我将在下文中将其称为代码点以避免混淆），并为更多字符保留空间。因此，一个char在现代通用计算机上只有 8 位大小，不能用于包含代码点。

UTF-8 https://en.wikipedia.org/wiki/UTF-8是将这些代码点编码为字节的一种方法。以下是您放入的字节text[]（假设使用 UTF-8 来编码代码点）以及它们代表的内容：

i:             0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
text[i]:    0x30 30 30 C3 9F 68 31 32 33 34 35 36 37 38 39 00
              -- -- -- ----- -- -- -- -- -- -- -- -- -- -- --
Code Point: U+30 30 30    DF 68 31 32 33 34 35 36 37 38 39  0
Graph:         0  0  0     ß  h  1  2  3  4  5  6  7  8  9

如您所见，UTF-8 是一种可变宽度编码。单个代码点编码为可变数量的字节。这意味着您无法在不扫描数组的情况下将“indexes-into-text”转换为“indexes-into-array-of-bytes”。

使用 UTF-8 编码的代码点以

0b0xxxxxxx    Represents an entire Code Point
0b110xxxxx    The start of a 2-byte sequence
0b1110xxxx    The start of a 3-byte sequence
0b11110xxx    The start of a 4-byte sequence

您在 UTF-8 中遇到的唯一其他形式的字节是

0b10xxxxxx    A continuation byte (the 2nd, 3rd or 4th byte of sequence)

A simple way to find the n^th Code Point in a string (if you assume the input is valid UTF-8) is to search for the n^th char for which (ch & 0xC0) != 0xC0 is true. You can use the same approach to count the number of Code Points in a string.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)