Python 爬虫-Robots协议

时间：2017-07-25 22:39:15 阅读：523 评论：0 收藏：0 [点我收藏+]

标签：原则 nbsp width span robot img 网络 tps htm

2017-07-25 21:08:16

一、网络爬虫的规模

二、网络爬虫的限制

? 来源审查：判断User‐Agent进行限制
　　检查来访HTTP协议头的User‐Agent域，只响应浏览器或友好爬虫的访问
? 发布公告：Robots协议
　　告知所有爬虫网站的爬取策略，要求爬虫遵守

三、Robots 协议

作用：网站告知网络爬虫哪些页面可以抓取，哪些不行
形式：在网站根目录下的robots.txt文件

如果网站不提供Robots协议则表示该网站允许任意爬虫爬取任意次数。

类人类行为原则上可以不遵守Robots协议

https://www.baidu.com/robots.txt
http://news.sina.com.cn/robots.txt

举例：

https://www.jd.com/robots.txt

User‐agent: *
Disallow: /?*
Disallow: /pop/*.html
Disallow: /pinpai/*.html?*
User‐agent: EtaoSpider
Disallow: /
User‐agent: HuihuiSpider
Disallow: /
User‐agent: GwdangSpider
Disallow: /
User‐agent: WochachaSpider
Disallow: /

# 注释，*代表所有，/代表根目录
User‐agent: *
Disallow: /

Python 爬虫-Robots协议

标签：原则 nbsp width span robot img 网络 tps htm

原文地址：http://www.cnblogs.com/TIMHY/p/7236670.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行