采集页面编码GBK处理注意的问题

时间：2019-08-20 12:28:38 阅读：77 评论：0 收藏：0 [点我收藏+]

现在项目基本都是UTF-8编码的，但是有个别网站还是GBK编码的，比如搜狗。采集到的GBK编码的页面处理会导致解析不了html内容。

处理方法

1. 转换页面内容为UTF-8
1. 替换页面的头部GBK为UTF-8 这一点很重要

     /**
             * 内容处理
             * 把GBK转码为utf-8
             * 头部标识编码 gbk替换成utf-8（采集的时候页面转换成了utf-8编码，header头也一定记得替换成utf-8编码，否则编码就会有问题）
             */
            $content = iconv('GBK', "UTF-8//ignore", $content);
            $content = preg_replace("/gb(k|2312)/i", "utf-8", $content);

之前处理字符串编码问题的方法

function doEncoding($str){
        $encode = strtoupper(mb_detect_encoding($str, ["ASCII",'UTF-8',"GB2312","GBK",'BIG5']));
        if($encode!='UTF-8'){
            $str = mb_convert_encoding($str, 'UTF-8', $encode);
        }
        return $str;
    }

curl 检测响应Content-Type编码是GBK的

    /**
     * 转换gbk编码为utf8
     * @param $html
     * @param $curl_info
     * @return mixed|string
     */
    private function do_html_to_utf8($html, $curl_info)
    {
        if($curl_info && preg_match("/gb(k|2312)/", $curl_info['content_type'], $match) > 0) {
            $encode = $match[0];
            $html = iconv($encode, "utf-8//IGNORE", $html);
            $html = preg_replace("/gb(k|2312)/", "utf-8", $html);
        }
        return $html;
    }

去掉html页面注释正则

$content= preg_replace('#<!--[^\!\[]*?(?<!\/\/)-->#' , '' ,$content);

采集页面编码GBK处理注意的问题

标签：htm 采集 head UNC 一点 ignore 项目转换 fun

原文地址：https://www.cnblogs.com/zqsb/p/11382115.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行