寒假学习报告10

时间：2020-02-10 22:56:37 阅读：130 评论：0 收藏：0 [点我收藏+]

标签：mamicode style pre ack code path text ext def

今天继续研究了爬虫

遇到了一些问题，各种查阅资料才得以解决。

response.xpath.extract()爬取的值里面含有\r\n\t，
如何去掉呢？需要normalize-space()
比如:
response.xpath(‘//div[@class=""]/text()‘).extract()
使用normalize-space()后：
response.xpath(‘normalize-space(//div[@class=""]/text())‘).extract()

在xpath的外面还可以用
name = name.replace(‘\r‘, ‘‘).replace(‘\t‘, ‘‘).replace(‘ ‘, ‘‘)

name = name.replace(‘\n‘, ‘‘)

name = name.replace(‘\t‘, ‘‘)

name = name.replace(‘ ‘, ‘‘)
来去除\r\n\t空格

scrapy 爬虫爬到<div>标签里面包含<p>标签
我想爬取div标签中的所有的内容，但是里面有p标签，
直接response.xpath(‘//div[@class=""]/text()‘).extract()的话是没有<div>里的<p>中的内容的，
需要response.xpath(‘//div[@class=""]/descendant::text()‘).extract()

scrapy中parse函数向其他函数传参

def parse(self, response):
    yield scrapy.Request(url,callback=self.next,meta={‘rname‘:‘2‘})
def next(self,response):
    print(response.meta[‘rname‘])

然后又把上一个程序优化了一下

技术图片

寒假学习报告10

标签：mamicode style pre ack code path text ext def

原文地址：https://www.cnblogs.com/baimafeima/p/12292978.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行