网络爬虫的“盗亦有道”

时间：2020-01-12 11:42:56 阅读：171 评论：0 收藏：0 [点我收藏+]

标签：isa img ref get 根目录排除限制 tao 自动

2.1 网络爬虫引发的问题

技术图片

图网络爬虫的尺寸

网络爬虫的限制

　　来源审查：判断User-Agent进行限制

　　检查来访HTTP协议头的User-Agent域，只响应浏览器或者友好爬虫的访问。

　　发布公告：Robots协议

　　告知所有爬虫网站的爬取策略，要求爬虫遵守

2.2 Robots协议

Robots Exclusion Standard 网络爬虫排除标准

　　作用：网站告知网络爬虫哪些页面可以抓取，哪些不行。

　　形式：在网站根目录下的robots.txt文件。

例子：京东的Robots协议

https://www.jd.com/robots.txt

User-agent: * 
Disallow: /?* 
Disallow: /pop/*.html 
Disallow: /pinpai/*.html?* 
User-agent: EtaoSpider 
Disallow: / 
User-agent: HuihuiSpider 
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider 
Disallow: /

http://www.baidu.com/robots.txt

http://www.sina.com.cn/robots.txt

http://news.sina.com.cn/robots.txt

http://www.qq.com/robots.txt

http://news.qq.com/robots.txt

http://www.sdju.edu.cn/robots.txt（无robots协议）

Robots协议的使用

　　网络爬虫：自动或人工识别robots.txt，再进行内容爬取。

　　约束性：Robots协议是建议但非约束性，网络爬虫可以不遵守，但存在法律风险。

网络爬虫的“盗亦有道”

标签：isa img ref get 根目录排除限制 tao 自动

原文地址：https://www.cnblogs.com/cripplepx/p/12181414.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行