标签:获得 问题 导致 catalog index elf pen ice encoding
笔者最近在看scrapy爬虫实战,在scrapy入门案例中遇到了许多问题,特别是在scrapy中使用css和xpath,遇到实际应用无法实现,只能做到基础的功能
于是笔者摆脱scrapy框架,按照requests这些基础知识来重做项目,发现运行速度远远低于scrapy框架!
下面是代码,代码也存在较多冗余,加剧了时间复杂度,导致运行速度过慢
import requests from lxml import etree class Books(object): def index(self,response): html = etree.HTML(response.text) # 结构化 # print(html) # 获取每本书的链接 index_xpath = html.xpath(‘//article[@class="product_pod"]/h3/a/@href‘) # print(index_xpath) # 获得下一页的链接 next = html.xpath(‘//div/ul[@class="pager"]/li[@class="next"]/a/@href‘) # print(next) next_url = requests.get("http://books.toscrape.com/" + next[0]) # print(next_url) html = etree.HTML(next_url.text) index_xpath.extend(html.xpath(‘//article[@class="product_pod"]/h3/a/@href‘)) # print(index_xpath) for i in index_xpath: # print(i) self.books(i) for i in range(2):#50页减去首页和第二页,因为第二页的下一页url自带catalogue #获得下一页的链接 next=html.xpath(‘//div/ul[@class="pager"]/li[@class="next"]/a/@href‘) # print(next) next_url=requests.get("http://books.toscrape.com/catalogue/"+next[0]) # print(next_url) html=etree.HTML(next_url.text) index_xpath.extend(html.xpath(‘//article[@class="product_pod"]/h3/a/@href‘)) print(index_xpath) for i in index_xpath: # print(i) self.books(‘catalogue/‘+i)#第三页之后的书本url和之前的不一样,需要加上catalogue/ def books(self,index_xpath): response = requests.get(‘http://books.toscrape.com/‘ + index_xpath) # print(response) html = etree.HTML(response.text) name = html.xpath(‘//div[@class="col-sm-6 product_main"]/h1/text()‘) price=html.xpath(‘//div[@class="col-sm-6 product_main"]/p[@class="price_color"]/text()‘) #星级难做! npc=html.xpath(‘//table[@class="table table-striped"]/tr[1]/td/text()‘) # s=re.compile(‘\d+‘) #库存只用xpath不会做! # stock=html.xpath(‘//table[@class="table table-striped"]/tbody/tr[last()-1]/td/text()‘) num=html.xpath(‘//table[@class="table table-striped"]/tr[last()]/td/text()‘) for i,j,k,l in zip(name,price,npc,num): params=str((i,j,k,l)) with open(‘books.csv‘,‘a‘,encoding=‘utf-8‘)as f: f.write(params+‘\n‘)#写入一行后自动换行 if __name__==‘__main__‘: response = requests.get(‘http://books.toscrape.com/‘) Books().index(response)
标签:获得 问题 导致 catalog index elf pen ice encoding
原文地址:https://www.cnblogs.com/fodalaoyao/p/10434672.html