books新手实践xpath

时间：2019-02-26 00:47:44 阅读：215 评论：0 收藏：0 [点我收藏+]

标签：获得问题导致 catalog index elf pen ice encoding

笔者最近在看scrapy爬虫实战，在scrapy入门案例中遇到了许多问题，特别是在scrapy中使用css和xpath，遇到实际应用无法实现，只能做到基础的功能

于是笔者摆脱scrapy框架，按照requests这些基础知识来重做项目，发现运行速度远远低于scrapy框架！

下面是代码，代码也存在较多冗余，加剧了时间复杂度，导致运行速度过慢

import requests
from lxml import etree

class Books(object):
    def index(self,response):
        html = etree.HTML(response.text)  # 结构化
        # print(html)
        # 获取每本书的链接
        index_xpath = html.xpath(‘//article[@class="product_pod"]/h3/a/@href‘)
        # print(index_xpath)
        # 获得下一页的链接
        next = html.xpath(‘//div/ul[@class="pager"]/li[@class="next"]/a/@href‘)
        # print(next)
        next_url = requests.get("http://books.toscrape.com/" + next[0])
        # print(next_url)
        html = etree.HTML(next_url.text)
        index_xpath.extend(html.xpath(‘//article[@class="product_pod"]/h3/a/@href‘))
        # print(index_xpath)
        for i in index_xpath:
            # print(i)
            self.books(i)

        for i in  range(2):#50页减去首页和第二页，因为第二页的下一页url自带catalogue
            #获得下一页的链接
            next=html.xpath(‘//div/ul[@class="pager"]/li[@class="next"]/a/@href‘)
            # print(next)
            next_url=requests.get("http://books.toscrape.com/catalogue/"+next[0])
            # print(next_url)
            html=etree.HTML(next_url.text)
            index_xpath.extend(html.xpath(‘//article[@class="product_pod"]/h3/a/@href‘))
        print(index_xpath)
        for i in index_xpath:
            # print(i)
            self.books(‘catalogue/‘+i)#第三页之后的书本url和之前的不一样，需要加上catalogue/

    def books(self,index_xpath):
        response = requests.get(‘http://books.toscrape.com/‘ + index_xpath)
        # print(response)
        html = etree.HTML(response.text)
        name = html.xpath(‘//div[@class="col-sm-6 product_main"]/h1/text()‘)
        price=html.xpath(‘//div[@class="col-sm-6 product_main"]/p[@class="price_color"]/text()‘)
        #星级难做！
        npc=html.xpath(‘//table[@class="table table-striped"]/tr[1]/td/text()‘)
        # s=re.compile(‘\d+‘)
        #库存只用xpath不会做！
        # stock=html.xpath(‘//table[@class="table table-striped"]/tbody/tr[last()-1]/td/text()‘)
        num=html.xpath(‘//table[@class="table table-striped"]/tr[last()]/td/text()‘)
        for i,j,k,l in zip(name,price,npc,num):
            params=str((i,j,k,l))
            with open(‘books.csv‘,‘a‘,encoding=‘utf-8‘)as f:
                f.write(params+‘\n‘)#写入一行后自动换行

if __name__==‘__main__‘:
    response = requests.get(‘http://books.toscrape.com/‘)
    Books().index(response)

books新手实践xpath

标签：获得问题导致 catalog index elf pen ice encoding

原文地址：https://www.cnblogs.com/fodalaoyao/p/10434672.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行