Scrapy中如何获取下一页链接

时间：2018-01-22 19:13:57 阅读：169 评论：0 收藏：0 [点我收藏+]

标签：rap one col body color 取数据 join str 定义

Scrapy从开始链接抓取数据，然后通过下一页链接不停的抓取更多的数据。

那么如何获取下一页链接呢，常见有两种方式：

1、通过当前页面的“下一页”链接获取，例如：

<div class=zw_page1>
下一篇：<a href="../../JokeHtml/bxnn/2017122722221351.htm">爆逗二货,醉人的笑容你会有</a>
</div>

此时获取的链接一般是相对url，需要将相对url转为绝对url，方法如下：

# 获取下一篇链接
nexthref = response.xpath(‘//div[@class="zw_page1"]/a/@href‘).extract_first()
if nexthref is not None:
    # 将相对url转为绝对url
    nexthref = response.urljoin(nexthref)

2、抓取数据的url有一定的规律，例如：

http://www.haha365.com/joke/index_1.htm

http://www.haha365.com/joke/index_2.htm

......

http://www.haha365.com/joke/index_1022.htm

此时可以通过自定义生成url的方式获取下一页url，方法如下：

# 获取下一篇链接
s1 = re.search(r‘index_[0-9]+‘, response.url, re.S)
s2 = re.search(r‘[0-9]+‘, s1.group(), re.S)
i = int(s2.group()) + 1
nexthref = "http://www.haha365.com/joke/index_"+str(i)+".htm"

Scrapy中如何获取下一页链接

标签：rap one col body color 取数据 join str 定义

原文地址：https://www.cnblogs.com/sam11/p/8329976.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行