正则表达式匹配中文字符及标点

时间：2015-07-13 13:38:27 阅读：162 评论：0 收藏：0 [点我收藏+]

标签：

可以写成这样

string strRegex = @"[\u4e00-\u9fa5]|[\（\）\《\》\——\；\，\。\“\”\<\>\！]";

其中前半部分表示匹配中文字符，后半部分为需要匹配的标点符号。

另，

对于html源码的处理，建议使用HtmlAgilityPack，用下面的代码去掉其中的脚本、样式或者注释内容。

public static HtmlDocument InitializeHtmlDoc(string htmlString)
{
    if (string.IsNullOrEmpty(htmlString))
    {
        return null;
    }

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(htmlString);
    doc.DocumentNode.Descendants().Where(n => n.Name == "script" || n.Name == "style" || n.Name == "#comment").ToList().ForEach(n => n.Remove());

    return doc;
}

HtmlAgilityPack是使用XPath语法，"//comment()"在XPath中表示“所有注释节点”，“#comment”不好用的话需要替换。http://www.cnblogs.com/rupeng/archive/2012/02/07/2342012.html

从Url读取网页内容（静态），可以用下面的代码

public static string GetHtmlStr(string url)
{
    if (string.IsNullOrEmpty(url))
    {
        return string.Empty;
    }

    string html = string.Empty;
    try
    {
        WebRequest webRequest = WebRequest.Create(url);
        webRequest.Timeout = 30 * 1000;
        using (WebResponse webResponse = webRequest.GetResponse())
        {
            if (((HttpWebResponse)webResponse).StatusCode == HttpStatusCode.OK)
            {
                Stream stream = webResponse.GetResponseStream();
                string coder = ((HttpWebResponse)webResponse).CharacterSet;

                StreamReader reader = new StreamReader(stream, string.IsNullOrEmpty(coder) ? Encoding.Default : Encoding.GetEncoding(coder));
                html = reader.ReadToEnd();
            }
        }
    }
    catch (Exception ex)
    {
        //Request may timeout sometimes
    }

    return html;
}

正则表达式匹配中文字符及标点

标签：

原文地址：http://www.cnblogs.com/urwlcm/p/4642454.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行