码迷,mamicode.com
首页 > 其他好文 > 详细

企查查简单爬虫

时间:2020-07-08 16:56:45      阅读:167      评论:0      收藏:0      [点我收藏+]

标签:sea   ddr   webkit   property   header   source   form   distinct   ons   

经历过企查查这个网站后,强烈感觉到使用抓包的重要性,以至于决定从此以后使用抓包进行模拟请求,放弃使用F12进行分析。

写下这篇文章,奠基死去的F12~~~

 1 import requests
 2 from lxml import etree
 3 
 4 url = "https://www.qcc.com/search?key=%E5%A4%A9%E6%B4%A5%E6%BB%A8%E6%B5%B7%E6%96%B0%E5%8C%BA"
 5 
 6 hed = {
 7     "host": "www.qcc.com",
 8     "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36",
 9     "upgrade-insecure-requests": "1",
10     "cookie": "QCCSESSID=vpk1mpc45ci95eu83etg528881; zg_did=%7B%22did%22%3A%20%221732cdcac86bf-0039dd6baef69a-4353761-100200-1732cdcac8844f%22%7D; UM_distinctid=1732cdcb0a713b-01b058b949aa5a-4353761-100200-1732cdcb0ab44e; hasShow=1; _uab_collina=159418552807339394444789; acw_tc=7d27c71c15941953776602556e6b8442bc8001e4e1270e8fead4b79557; CNZZDATA1254842228=1092104090-1594185078-https%253A%252F%252Fwww.baidu.com%252F%7C1594195878; Hm_lvt_78f134d5a9ac3f92524914d0247e70cb=1594194111,1594195892,1594195918,1594196042; Hm_lpvt_78f134d5a9ac3f92524914d0247e70cb=1594196294; zg_de1d1a35bfa24ce29bbf2c7eb17e6c4f=%7B%22sid%22%3A%201594185526424%2C%22updated%22%3A%201594196294349%2C%22info%22%3A%201594185526455%2C%22superProperty%22%3A%20%22%7B%7D%22%2C%22platform%22%3A%20%22%7B%7D%22%2C%22utm%22%3A%20%22%7B%5C%22%24utm_source%5C%22%3A%20%5C%22baidu1%5C%22%2C%5C%22%24utm_medium%5C%22%3A%20%5C%22cpc%5C%22%2C%5C%22%24utm_term%5C%22%3A%20%5C%22pzsy%5C%22%7D%22%2C%22referrerDomain%22%3A%20%22www.baidu.com%22%2C%22cuid%22%3A%20%22fd05f1ac2b561244aaa6b27b3bb617a4%22%7D",
11 }
12 
13 resq = requests.get(url = url,headers = hed).content
14 response = etree.HTML(resq)
15 
16 title_list = []
17 title = response.xpath(//*[@id="search-result"]//tr/td[3]/a//text())
18 for tit in title:
19     tit = tit.replace(,,‘‘).strip()
20     title_list.append(tit)
21 
22 addr_list = []
23 addrs = response.xpath(//*[@id="search-result"]//tr/td[3]/p[4]//text())
24 for addr in addrs:
25     addr = addr.replace(,,‘‘).strip()
26     addr_list.append(addr)
27 
28 print(title_list)
29 print(addr_list)

代码很简单,甚至于简陋,为什么要记录下这个爬虫,因为请求头部信息,自己进行分析,和ctrl+c+v导致请求头数据不准确,严重感觉到抓包工具的请求分析更加快速有效。

继续加油,继续努力

企查查简单爬虫

标签:sea   ddr   webkit   property   header   source   form   distinct   ons   

原文地址:https://www.cnblogs.com/meipu/p/13267792.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!