多字节字符与宽字符重新认识

时间：2015-02-11 18:48:59 阅读：147 评论：0 收藏：0 [点我收藏+]

一直都说，多字节字符，何为多字节，并不只是一个char就是了。英文的字符都是char能表示，但是中文字符，是2个字节表示的。

所以，

char s[] = "ha哈哈";
    int l = strlen(s);// 6
    char c = s[2];// -71 ‘?‘ cannot represent

s 是占7个字节。

s[2]只是‘哈‘的前半部分，所以决不能写这样的比较代码！！！：

if (s[2]==‘哈‘)

所以，字符串中有中文时，一定要格外小心。

甚至，所有的C、C++的字符串库函数都是可能得出错误结果的。如下：

#include <stdio.h>
//#include <stddef.h>
#include <locale.h>
#include <string.h>

#include <string>
using namespace std;
 
int main()
{
    setlocale(LC_CTYPE, "chs");// useless
    
    char ss22[] = "/\\hh猪哈猪头";// ‘/‘:2F, ‘\\‘:5C  GBK: 81~FE(H),40~FF(L)
	//ss22[5] = 47;
	ss22[7] = 92;
	char *p1 = strrchr(ss22, ‘/‘);//
	char *p2 = strrchr(ss22, ‘\\‘);// got unwished pos
	
	string sstr22 = ss22;
	size_t pos2 = sstr22.rfind(‘\\‘);// also got unwished pos
    
    printf("%p %p %p %d\n", ss22, p1, p2, (int)pos2);
    //----------

	char ss33[] = "hh哈猪头/\\";
	ss33[3] = 92;
	char *p11 = strchr(ss33, ‘\\‘);// got unwished pos
	char *p12 = strstr(ss33, "\\");// also got unwished pos

	string sstr33 = ss33;
	size_t pos31 = sstr33.find(‘\\‘);// also got unwished pos
	size_t pos32 = sstr33.find("\\");// also got unwished pos
    printf("%p %p %p %d %d\n", ss33, p11, p12, (int)pos31, (int)pos32);
}

当多字节字符串中中文字符的低字节部分与已有的英文字符编码相同时，查找字符或字符串时都会一视同仁，无法察觉出是中文字符的部分还是英文字符。

所以在程序中，有中文字符的情况下，还是多用UNICODE（UCS）或UTF8吧，windows API 也多用 W版的吧。

最近工作中，有用GetModuleFileNameA得到模块目录，然后用strrchr(buf, ‘\\‘);判断有木有斜杠符，最后忽然想到这个多字节的问题。坑爹啊！！！

对于多字节，可用mblen 函数来查询一个多字节字符所占用的字节数。

和多字节相关的还经常要用到区域信息，可用setlocale()来设置。

下面是用mbtowc（多字节字符转成宽字符）来实现mbstowcs：

size_t mbstowcs(wchar_t *pwcs, char *pmbs, size_t n)
{
	size_t i = 0;
	mbtowc(NULL, NULL, 0);// init shift state
	while (*pmbs && i < n)
	{
		int len = mbtowc(&pwcs[i++], pmbs, MB_CUR_MAX);
		if (len == -1)
			return (size_t)-1;
		pmbs += len;// to next mb char
	}
	return i;
}

// MB_CUR_MAX: the value of which is the maximum number of bytes in a multibyte character
with the current locale (category LC_CTYPE).

本文出自 “v” 博客，请务必保留此出处http://4651077.blog.51cto.com/4641077/1613791

多字节字符与宽字符重新认识

标签：c++ c 多字节字符

原文地址：http://4651077.blog.51cto.com/4641077/1613791

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行