crawl spider

时间：2018-06-18 18:28:33 阅读：138 评论：0 收藏：0 [点我收藏+]

标签：str tor pat 正则 links extract restrict 查看 nbsp

crawlspider

使用
scrapy genspider -t crawl 文件名字网址

crawlspider是什么？
也是一个spider，是Spider的一个子类，所以其功能要比Spider要强大
多的一个功能是：提取链接的功能，根据一定的规则，提取指定的链接

链接提取器
LinkExtractor(
allow=xxx, # 正则表达式，要（*）
deny=xxx, # 正则表达式，不要这个
restrict_xpaths=xxx, # xpath路径（*）
restrict_css=xxx, # 选择器（*）
deny_domains=xxx, # 不允许的域名
)

通过正则提取链接
links = LinkExtractor(allow=r‘/movie/\?page=\d‘)
将所有包含这个正则表达式的href全部获取到返回
links.extract_links(response)进行查看提取到的链接
【注】将重复的url去除掉
通过xpath提取
links = LinkExtractor(restrict_xpaths=‘//ul[@class="pagination pagination-sm"]/li/a‘)
通过css提取
links = LinkExtractor(restrict_css=‘.pagination > li > a‘)

crawl spider

标签：str tor pat 正则 links extract restrict 查看 nbsp

原文地址：https://www.cnblogs.com/airapple/p/9195467.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行