码迷,mamicode.com
首页 > Web开发 > 详细

PHP 抓取函数curl 实践

时间:2019-03-19 16:30:50      阅读:187      评论:0      收藏:0      [点我收藏+]

标签:meego   ==   gzip   windows   差值   cti   webp   执行   XML   

最近在学习curl的抓取实践, 在里面也学到了一些东西。

有一些网站需要cookie才可以抓取成功。这个时候我们就可以通过fiddle4 去抓包实现。然后通过构建头部信息 ,绕过网站端的验证。

以下是实现代码:

 

<?php
$stime=microtime(true);


$ch = curl_init();
$ckw = urlencode("圆形折叠麻将机全自动餐桌两用带椅子机麻家用欧式实木电动麻将桌PHP");
curl_setopt($ch, CURLOPT_URL, " http://www.meegoe.com/sou_Y3sl/ ?keywords=".$ckw);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);//https请求 不验证证书
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);//https请求 不验证HOST
curl_setopt($ch, CURLOPT_TIMEOUT, 10); // 3秒超时
curl_setopt($ch, CURLOPT_HEADER, 0); // 不需要页面的HTTP头
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // 抓取结果直接返回(如果为0,则直接输出内容到页面)
curl_setopt($ch, CURLOPT_ENCODING, ‘gzip,deflate‘);//这个是解释gzip内容
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
‘Connection: keep-alive‘,
‘Upgrade-Insecure-Requests: 1‘,
‘User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36‘,
‘Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8‘,
‘Referer: http://www.meegoe.com/ ‘,
‘Accept-Encoding: gzip, deflate, br‘,
‘Accept-Language: zh-CN,zh;q=0.9‘,
));
curl_setopt($ch,CURLOPT_COOKIE,‘cna=3kDtFId2QkgCAbaBNczxpp/X; hng=CN%7Czh-CN%7CCNY%7C156; UM_distinctid=1694cacf99e377-0508f14d59d804-b781636-100200-1694cacf99f503; lid=haosweet; ali_ab=139.207.69.183.1552121532339.7; ali_apache_id=11.23.78.17.1552121946782.441191.7; h_keys="%u714e%u997c#%u4ebf%u5065%u8702%u80f6#arrl#%u8702%u871c"; ad_prefer="2019/03/10 16:12:09"; alicnweb=touch_tb_at%3D1552399125776%7Clastlogonid%3Dlweizwer1%7Cshow_inter_tips%3Dfalse%7ChomeIdttS%3D00964423784623518415084922041883626898%7ChomeIdttSAction%3Dtrue; l=bBEVjusVvczbe9OxBOCiVQhfhO_t7IRxMuSJcRVMi_5ZNsY1VUQOl_0dEUv6Vj5RsLYB4z6vzNp9-etlw; XSRF-TOKEN=2f40c7b9-d618-4a87-929c-6ec68e187b2c; CNZZDATA1261998348=973403025-1552118420-%7C1552486140; _m_h5_tk=4727e399320af3a5455fcd4219c72fab_1552498204613; _m_h5_tk_enc=12f885140e5402698a7e5018ae67b361; cookie2=58efdb753105b686ecef55419f454629; t=dbb53d6a3987478e0c4062af14864a21; _tb_token_=5376f111de15e; __wapcsf__=1; CNZZDATA1000231236=1647319684-1552120971-https%253A%252F%252Fm.1688.com%252F%7C1552488361; cookie1=AC0hYCJXlFdQ3RN7W3VKzBxzzaCSgNA0AcLxtMnL0LY%3D; cookie17=UUwZ%2FIsa2fE%3D; sg=t49; csg=1d0e7cb7; unb=24919014; cn_tmp=Z28mC+GqtZ3c/avrs/YMFPZ2bd/61p+3miHRxh1Ln3dRZyI6Ty99coP9r4fXQg5HeFCbmBGfCoM0uJ5S4hKyp8d/dCaXlrTBFg1H7FJYAmCE010I4N7w+Jg7Vhr/kazJ; cn_m_s=f45iaMZwweUupgX4WQ4KM4ETJYv3kX2cqWhNxyCgwpVdxe/jMOVBC3dCOcy+zuVVQ8hOf3FxKpA=; ali_apache_track="c_mid=b2b-24919014|c_lid=haosweet"; tbsnid=L/wTnzRvkMphMdhmdLmpLeGySwk1OIyE25mgNN62TSI6sOlEpJKl9g==; __cn_logon__=true; __cn_logon_id__=haosweet; ali-ss=bG9naW5JZD1oYW9zd2VldCZsb2dpbk1lc3NhZ2VFcnJvcj0mbG9naW5TdGF0dXNSZXRNc2c9JnVzZXJJZD0yNDkxOTAxNCZsb2dpbkVycm9yVXNlck5hbWU9JmNoZWNrY29kZT0mbWVtYmVySWQ9YjJiLTI0OTE5MDE0JnNpZD01OGVmZGI3NTMxMDViNjg2ZWNlZjU1NDE5ZjQ1NDYyOSZlY29kZT1pZVNtSw==; isg=BGxsuluhHLbc1gh-fAzC2pEuPUpejRHMGizBBMatWZe60Q3b7jeYXvJz9dlM2Ugn‘);
$output = curl_exec($ch);
curl_close($ch);

$contents = preg_replace("/([\r\n|\n|\t| ]+)/",‘‘,$output);
$contents = mb_convert_encoding($contents, ‘gbk‘, ‘UTF-8‘);
print_r($contents);
preg_match_all("/(<divclass=\"container-tag\"><ahref=\"(.*)\">)(.*)(<\/a><\/div>)/iUs",$contents,$matchrelated);

 

print_r($matchrelated[3]);



$etime=microtime(true);//获取程序执行结束的时间
$total=$etime-$stime; //计算差值

echo "<br />当前页面执行时间为:{$total} 秒";
?>

PHP 抓取函数curl 实践

标签:meego   ==   gzip   windows   差值   cti   webp   执行   XML   

原文地址:https://www.cnblogs.com/95327gouw/p/10559208.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!