scrapy 【meta】的高级应用

时间：2019-01-14 20:17:25 阅读：164 评论：0 收藏：0 [点我收藏+]

标签：findall current pre tail cat find extract gen detail

下面以一个网站的图书爬取为例，数据需要记录大分类、小分类等信息。

页面有大分类页面、小分类页面、列表页面、详情页面、

需要一条数据，包括所有的信息，所以借助meta的作用来把当前响应的数据进行传递给下一个解析函数。

为什么需要深拷贝呢？防止一条数据还没有收集完全，就被下一个请求重新定义item给洗刷掉之前所得到的数据。

    def parse(self, response):
        #1.大分类分组
        li_list = response.xpath("//ul[@class=‘ulwrap‘]/li")
        for li in li_list:
            item = {}
            item["b_cate"] = li.xpath("./div[1]/a/text()").extract_first()
            #2.小分类分组
            a_list = li.xpath("./div[2]/a")
            for a in a_list:
                item["s_href"] = a.xpath("./@href").extract_first()
                item["s_cate"] = a.xpath("./text()").extract_first()
                if item["s_href"] is not None:
                    item["s_href"]= "http://snbook.suning.com/" + item["s_href"]
                    yield scrapy.Request(
                        item["s_href"],
                        callback=self.parse_book_list,
                        meta = {"item":deepcopy(item)}
                    )

    def parse_book_list(self,response):
        item = deepcopy(response.meta["item"])
        #图书列表页分组
        li_list = response.xpath("//div[@class=‘filtrate-books list-filtrate-books‘]/ul/li")
        for li in li_list:
            item["book_name"] = li.xpath(".//div[@class=‘book-title‘]/a/@title").extract_first()
            item["book_img"] = li.xpath(".//div[@class=‘book-img‘]//img/@src").extract_first()
            if item["book_img"] is None:
                item["book_img"] = li.xpath(".//div[@class=‘book-img‘]//img/@src2").extract_first()
            item["book_author"] = li.xpath(".//div[@class=‘book-author‘]/a/text()").extract_first()
            item["book_press"] = li.xpath(".//div[@class=‘book-publish‘]/a/text()").extract_first()
            item["book_desc"] = li.xpath(".//div[@class=‘book-descrip c6‘]/text()").extract_first()
            item["book_href"]= li.xpath(".//div[@class=‘book-title‘]/a/@href").extract_first()
            yield scrapy.Request(
                item["book_href"],
                callback=self.parse_book_detail,
                # 传递给下一个解析函数
                meta = {"item":deepcopy(item)}
            )

        #翻页
        page_count = int(re.findall("var pagecount=(.*?);",response.body.decode())[0])
        current_page =  int(re.findall("var currentPage=(.*?);",response.body.decode())[0])
        if current_page<page_count:
            next_url = item["s_href"] +"?pageNumber={}&sort=0".format(current_page+1)
            yield scrapy.Request(
                next_url,
                callback=self.parse_book_list,
                meta = {"item":response.meta["item"]}
            )



    def parse_book_detail(self,response):
        item = response.meta["item"]
        item["book_price"] = re.findall("\"bp\":‘(.*?)‘,",response.body.decode())
        item["book_price"] = item["book_price"][0] if len(item["book_price"])>0 else None
        print(item)

scrapy 【meta】的高级应用

标签：findall current pre tail cat find extract gen detail

原文地址：https://www.cnblogs.com/tangkaishou/p/10268388.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行