PHP curl扩展实现数据抓取

时间：2016-07-20 11:50:37 阅读：282 评论：0 收藏：0 [点我收藏+]

标签：

PHP版本：5.5.30

服务器：apche

抓取网站地址：http://nc.mofcom.gov.cn/channel/gxdj/jghq/jg_list.shtml

抓取目标：获取当日的价格数据

一、准备工作：

1.打开php.ini配置文件，开启curl功能扩展

　　extension=php_curl.dll

　　如果开启之后（记得重启apache服务），仍然不能使用curl的扩展方法，请检查环境变量是否设置。

二、分析所要抓取的页面，数据结构

　　查看http://nc.mofcom.gov.cn/channel/gxdj/jghq/jg_list.shtml源码。

　　看到所有数据都在<tbody>和</tbody>标签之中。

三、思路

　　1.获取当前网页数据，截取到<tbody>和</tbody>。（其中数据为多行组成：<tr><td></td></tr>）

　　2.在<tbody>和</tbody>中，逐个读取行，和列的值即可获得数据。

　　3.当日的数据，可能在第一页，第二页等多页。因此，进行翻页读取，直到页面中找不到当日的数据，停止程序运行！

　　总共就是三个循环：页循环-->行循环-->列循环

四、代码实现

//定义要抓取的页面地址

$url=‘http://nc.mofcom.gov.cn/channel/gxdj/jghq/jg_list.shtml‘;//?page=1

//初始化curl连接
$curl=curl_init();
// 设置header
curl_setopt($curl, CURLOPT_HEADER, 0);
// 设置cURL 参数，要求结果保存到字符串中还是输出到屏幕上。
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);

$flag="start"; //页循环执行标志，start:表示继续执行。 end：标志终止执行（本页面找不到包含当天日期的数据）。

//开始页循环
$i=1;      //$i作为page的值，控制翻页
$myResultSet=[];//定义数组，用于存放所有行的数据
while ($i>0)//循环停止条件为：本页面找不到包含当天日期的数据
{
    if($flag=="end"){
        //停止页循环
        break;
    }
    //设置所要抓取的网页地址，及页数
    curl_setopt($curl, CURLOPT_URL,$url."?page=".$i);

    // 运行cURL，请求网页
    $data = curl_exec($curl);

    //通过substr和strpos函数，获取<tbody>和</tbody>之间的数据
    $mycontent=substr($data, strpos($data, "<tbody>"),strpos($data, "</tbody>")-strpos($data, "<tbody>"));

    //<table>行循环读取（<tr>）
    while(strpos($mycontent, "<tr>")!==false){//如果行存在strpos($mycontent, "<tr>")
         $myResultTr=[];                      //定义局部行数组。用于存放一行的数据，每次行循环重新定义，以清空数组中上一行的数据
         //获取行
         $mytr=substr($mycontent, strpos($mycontent, "<tr>"),strpos($mycontent, "</tr>")-strpos($mycontent, "<tr>"));
         //如果不是当前日期的数据、设置程序终止标志、退出循环
         if(strpos($mytr, date("Y-m-d"))===false){             //本页面找不到包含当天日期的数据,
             $flag="end";                                //设置标志为end。
             break;
         }
         //列循环获取前四列
         $j=0;      //列标识
         //&&strpos($mytr, date("Y-m-d"))!==false
         while(strpos($mytr, "<td>")!==false){
             if($j>=4){   //第4列后的数据是不需要的。
                 break;
             }
             //获取列
             $mytd=substr($mytr, strpos($mytr, "<td>"),strpos($mytr, "</td>")-strpos($mytr, "<td>"));

             $pre=array(" ","　","\t","\n","\r");
             //str_replace将值中包含的空格、制表符、换行符、回车去除。 strip_tags去除html标签
             //echo str_replace($pre, ‘‘, strip_tags($mytd));

             //将列数据，压入行结果中
             array_push($myResultTr, str_replace($pre, ‘‘, strip_tags($mytd)));

             $mytr=substr($mytr, strpos($mytr, "</td>")+5);
             $j++;
         }

         //将行结果，压入到总结果集中
         array_push($myResultSet, $myResultTr);

         //改变$mycontent （截取掉第一行数据，保留剩下数据
         $mycontent=substr($mycontent, strpos($mycontent, "</tr>")+5);
     }

    $i++;
}
var_dump($myResultSet);   //打印出抓取到的所有数据
echo "application stop in page".--$i;
// 关闭URL请求
curl_close($curl);

五、测试结果：

　　抓取到127条数据，网页执行到第10页的时候停止下来。

　　技术分享

PHP curl扩展实现数据抓取

标签：

原文地址：http://www.cnblogs.com/ahguSH/p/5687653.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行