标签:解决办法 slist params print 取数据 网上 with open class 响应
爬虫:通过编写程序员模拟浏览器上网,然后让其去互联网上爬取/获取数据的过程。
首先,我们来看一下requests模块的使用:
requests模块:一个网络请求的模块
环境的安装: pip install request
requests模块的作用:模拟浏览器发送请求
requests的编码流程:
指定url
发送请求
获取响应数据
进行持久化储存
#爬取搜狗首页的页面数据 import requests #指定url url = ‘https://www.sogou.com/‘ #s发起请求:get方法的返回值就是一个响应对象 response = requests.get(url=url) #获取响应数据:text属性返回的是字符串形式的响应数据 page_text = response.text print(page_text) #持久化存储 with open(‘./sogou.html‘,‘w‘,encoding=‘utf-8‘) as fp: fp.write(page_text)
下面做几个练习,熟悉一下requests模块
需求:爬取搜狗指定词条搜索后的页面数据 基于requests模块的post请求
import requests #1 url = ‘https://www.sogou.com/web‘ #处理url请求的参数 wd = input(‘enter a word:‘) param = { ‘query‘:wd } #2 response = requests.get(url=url,params=param) #设置响应数据的编码 response.encoding = ‘utf-8‘ #3获取响应数据:text属性返回的是字符串形式的响应数据 page_text = response.text print(page_text) #4.持久化储存 name = wd+‘.html‘ with open(name,‘w‘,encoding=‘utf-8‘) as fp: fp.write(page_text) print(name,‘爬取成功!‘)
这个里面还有个UA反爬机制(UA检测),解决办法是UA伪装
import requests #1 url = ‘https://www.sogou.com/web‘ #处理url请求的参数 wd = input(‘enter a word:‘) param = { ‘query‘:wd } headers = { ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36‘ } #2.应用了UA伪装实现的请求发送 response = requests.get(url=url,params=param,headers=headers) #设置响应数据的编码 response.encoding = ‘utf-8‘ #3 page_text = response.text print(page_text) #4. name = wd+‘.html‘ with open(name,‘w‘,encoding=‘utf-8‘) as fp: fp.write(page_text) print(name,‘爬取成功!‘)
需求:爬取国家药品监督管理总局中基于中华人民共和国化妆品生产许可证相关数据http://125.35.6.84:81/xk/
思路:
import requests headers = { ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36‘ } post_url = ‘http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList‘ IDs = [] all_data = [] #存储所有企业的详情数据 for page in range(1,5): data = { "on": "true", "page": str(page), "pageSize": "15", "productName":"" , "conditionType":"1", "applyname": "", "applysn": "", } #首页ajax请求返回的响应数据(解析出ID) json_obj = requests.post(url=post_url,headers=headers,data=data).json()for dic in json_obj[‘list‘]: ID = dic[‘ID‘] IDs.append(ID) for id in IDs: detail_post_url = ‘http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById‘ data = { ‘id‘:id } detail_dic = requests.post(url=detail_post_url,headers=headers,data=data).json() all_data.append(detail_dic) print(all_data[0])
标签:解决办法 slist params print 取数据 网上 with open class 响应
原文地址:https://www.cnblogs.com/lilei1996/p/10916446.html