爬虫
网络爬虫(又称网页蜘蛛,网络机器人), 是一种按照一定规则,自动的抓取万维网信息的程序或者脚本。
http与服务器交互的方法:
get 仅仅获取资源的信息,不增加或者修改数据
post 一般放到该服务器上的资源,一般通过form表单进行提交请求
put 增加
delete 删除
Requests模块 安装 pip install requests
import requests
1、get方式
import requests params = {‘key1‘:‘hello‘,‘key2‘:‘world‘} url = ‘http://www.baidu.com‘ r = requests.get(url=url,params=params) print(r.url)
运行结果:
http://www.baidu.com/?key1=hello&key2=world
2、post方式
import requests params = {‘key1‘:‘hello‘,‘key2‘:‘world‘} params = {‘key1‘:‘hello‘,‘key2‘:‘world‘} r = requests.post("http://httpbin.org/post",data = params) print(r.text)
运行结果:
{ "args": {}, "data": "", "files": {}, "form": { "key1": "hello", "key2": "world" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "21", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.18.4" }, "json": null, "origin": "113.116.146.147", "url": "http://httpbin.org/post"
3、相应http的请求
import requests url = "http://qiushibaike.com/" r = requests.get(url=url) print(r.encoding) print(type(r.text)) print(type(r.content))
运行结果:
UTF-8 <class ‘str‘> <class ‘bytes‘>
Requests 中text和centent的区别是什么
r.text 返回str类型数据 可用于获取文本类型数据
r.content 返回bytes型 可用于获取图片,文件
4、其他常用方法
import requests header = {‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36‘} r = requests.get(‘https://www.qiushibaike.com/‘, headers=header) # print(r.text) print(r.request) print(r.headers) print(r.cookies) print(r.url) print(r.status_code)
运行结果
<PreparedRequest [GET]> {‘Server‘: ‘openresty‘, ‘Date‘: ‘Sun, 21 Jan 2018 01:08:11 GMT‘, ‘Content-Type‘: ‘text/html; charset=UTF-8‘, ‘Content-Length‘: ‘18094‘, ‘Connection‘: ‘keep-alive‘, ‘Content-Encoding‘: ‘gzip‘, ‘Set-Cookie‘: ‘_xsrf=2|acc1cc58|fb495aec5628f018bc13a85be6a76a81|1516496891; Path=/‘, ‘Vary‘: ‘User-Agent, Accept-Encoding‘, ‘Etag‘: ‘"560c073021ccc9e765bb6f4e4b4182594d4664ec"‘} <RequestsCookieJar[<Cookie _xsrf=2|acc1cc58|fb495aec5628f018bc13a85be6a76a81|1516496891 for www.qiushibaike.com/>]> https://www.qiushibaike.com/ 200
Request的会话对象
Python3 s = requests.session()
Python2 S = requests.Session()
所有一次会话的信息都保存在s中,只需要对s进行操作就可以了。
s.get(url)