string转utf8后解决TTS识别中文的问题

时间：2017-06-27 09:52:22 阅读：198 评论：0 收藏：0 [点我收藏+]

标签：ring char 推断 word 不能 for utf8编码 code 多字节

今天遇到string字符编码的问题。由于遇到了用TTS将文本转语音的一个API。里面的中文必须是utf8的，我传了一个uncode编码的中文进去，就一直不能正常读出来。后来才发现是编码的问题。这里在网上找到两个API。可将string 传成utf8编码的string。挺好用的。

记录下来：

std::string string_To_UTF8(const std::string & str)
{
    int nwLen = ::MultiByteToWideChar(CP_ACP, 0, str.c_str(), -1, NULL, 0);

    wchar_t * pwBuf = new wchar_t[nwLen + 1];//一定要加1，不然会出现尾巴
    ZeroMemory(pwBuf, nwLen * 2 + 2);

    ::MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), pwBuf, nwLen);

    int nLen = ::WideCharToMultiByte(CP_UTF8, 0, pwBuf, -1, NULL, NULL, NULL, NULL);

    char * pBuf = new char[nLen + 1];
    ZeroMemory(pBuf, nLen + 1);

    ::WideCharToMultiByte(CP_UTF8, 0, pwBuf, nwLen, pBuf, nLen, NULL, NULL);

    std::string retStr(pBuf);

    delete []pwBuf;
    delete []pBuf;

    pwBuf = NULL;
    pBuf  = NULL;

    return retStr;
}
BOOL IsTextUTF8(char* str,ULONGLONG length)
{
    DWORD nBytes=0;//UFT8可用1-6个字节编码,ASCII用一个字节
    UCHAR chr;
    BOOL bAllAscii=TRUE; //假设所有都是ASCII, 说明不是UTF-8
    for(int i=0; i<length; ++i)
    {
        chr= *(str+i);
        if( (chr&0x80) != 0 ) // 推断是否ASCII编码,假设不是,说明有可能是UTF-8,ASCII用7位编码,但用一个字节存,最高位标记为0,o0xxxxxxx
            bAllAscii= FALSE;
        if(nBytes==0) //假设不是ASCII码,应该是多字节符,计算字节数
        {
            if(chr>=0x80)
            {
                if(chr>=0xFC&&chr<=0xFD)
                    nBytes=6;
                else if(chr>=0xF8)
                    nBytes=5;
                else if(chr>=0xF0)
                    nBytes=4;
                else if(chr>=0xE0)
                    nBytes=3;
                else if(chr>=0xC0)
                    nBytes=2;
                else
                    return FALSE;

                nBytes--;
            }
        }
        else //多字节符的非首字节,应为 10xxxxxx
        {
            if( (chr&0xC0) != 0x80 )
                return FALSE;

            nBytes--;
        }
    }
    if( nBytes > 0 ) //违返规则
        return FALSE;
    if( bAllAscii ) //假设所有都是ASCII, 说明不是UTF-8
        return FALSE;

    return TRUE;
}

string转utf8后解决TTS识别中文的问题

标签：ring char 推断 word 不能 for utf8编码 code 多字节

原文地址：http://www.cnblogs.com/jzdwajue/p/7083266.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行