标签:img 存在 date http md5 标签 print get etc
# -*- coding: utf-8 -*- import scrapy class GetChoutiSpider(scrapy.Spider): name = ‘get_chouti‘ allowed_domains = [‘chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] def parse(self, response): # 在子子孙孙中找到所有id="dig_lcpage"的div标签 # 在对应的div标签中找到所有的a标签 # 获取所有对应a标签的href属性 # 加上extract()获取字符串 res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() for url in res: print(url) ‘‘‘ /all/hot/recent/2 /all/hot/recent/3 /all/hot/recent/4 /all/hot/recent/5 /all/hot/recent/6 /all/hot/recent/7 /all/hot/recent/8 /all/hot/recent/9 /all/hot/recent/10 /all/hot/recent/2 ‘‘‘ # 会发现这里有重复的,因为我们起始是第一页,每次总共分十页。那么下一页指的就是第二页 # 所以会发现第二页重复的href重复了 # 可以定义一个集合 urls = set() for url in res: if url in urls: print(f"{url}--此url已存在") else: urls.add(url) print(url) ‘‘‘ /all/hot/recent/2 /all/hot/recent/3 /all/hot/recent/4 /all/hot/recent/5 /all/hot/recent/6 /all/hot/recent/7 /all/hot/recent/8 /all/hot/recent/9 /all/hot/recent/10 /all/hot/recent/2--此url已存在 ‘‘‘
# -*- coding: utf-8 -*- import scrapy class GetChoutiSpider(scrapy.Spider): name = ‘get_chouti‘ allowed_domains = [‘chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] def parse(self, response): # 上面是直接将url进行比较,但是一般情况下我们不直接比较url # url我们可能会放在缓存里,或者放在数据库里 # 如果url很长,会占用空间,因此我们会进行一个加密,比较加密之后的结果 res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() # 也可以直接找到所有想要的a标签 ‘‘‘ 找到a标签,什么样的a标签,以"/all/hot/recent/"开头的a标签 res = response.xpath(‘//a[starts-with(@href, "/all/hot/recent/")]/@href‘).extract() 也可以通过正则表达式来找到a标签,re:test是固定写法 res = response.xpath(‘//a[re:test(@href, "/all/hot/recent/\d+")]/@href‘).extract() ‘‘‘ md5_urls = set() for url in res: md5_url = self.md5(url) if md5_url in md5_urls: print(f"{url}--此url已存在") else: md5_urls.add(md5_url) print(url) def md5(self, url): import hashlib m = hashlib.md5() m.update(bytes(url, encoding="utf-8")) return m.hexdigest()
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request class GetChoutiSpider(scrapy.Spider): name = ‘get_chouti‘ allowed_domains = [‘chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] # 当递归查找时,会反复执行parse,因此md5_urls不能定义在parse函数里面 md5_urls = set() def parse(self, response): res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() for url in res: md5_url = self.md5(url) if md5_url in self.md5_urls: pass else: print(url) self.md5_urls.add(md5_url) # 将新的要访问的url放到调度器 url = "https://dig.chouti.com%s" % url yield Request(url=url, callback=self.parse) ‘‘‘ /all/hot/recent/2 /all/hot/recent/3 /all/hot/recent/4 /all/hot/recent/5 /all/hot/recent/6 /all/hot/recent/7 /all/hot/recent/8 /all/hot/recent/9 /all/hot/recent/10 /all/hot/recent/1 /all/hot/recent/11 /all/hot/recent/12 ........ ........ ........ /all/hot/recent/115 /all/hot/recent/116 /all/hot/recent/117 /all/hot/recent/118 /all/hot/recent/119 /all/hot/recent/120 ‘‘‘ def md5(self, url): import hashlib m = hashlib.md5() m.update(bytes(url, encoding="utf-8")) return m.hexdigest()
可以看到,spider将所有的页码全都找出来了,但我不想它把全部页码都找出来,因此可以指定爬取的深度
在settings里面加上DEPTH_LIMIT=2,表示只爬取两个深度,即当前十页完成之后再往后爬取两个深度。
如果DEPTH_LIMIT<0,那么只爬取一个深度,等于0,全部爬取,大于0,按照指定值爬取相应的深度
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request class GetChoutiSpider(scrapy.Spider): name = ‘get_chouti‘ allowed_domains = [‘chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] # 当递归查找时,会反复执行parse,因此md5_urls不能定义在parse函数里面 md5_urls = set() def parse(self, response): res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() for url in res: md5_url = self.md5(url) if md5_url in self.md5_urls: pass else: print(url) self.md5_urls.add(md5_url) # 将新的要访问的url放到调度器 url = "https://dig.chouti.com%s" % url yield Request(url=url, callback=self.parse) ‘‘‘ /all/hot/recent/2 /all/hot/recent/3 /all/hot/recent/4 /all/hot/recent/5 /all/hot/recent/6 /all/hot/recent/7 /all/hot/recent/8 /all/hot/recent/9 /all/hot/recent/10 /all/hot/recent/1 /all/hot/recent/11 /all/hot/recent/12 /all/hot/recent/13 /all/hot/recent/14 /all/hot/recent/15 /all/hot/recent/16 /all/hot/recent/17 /all/hot/recent/18 ‘‘‘ def md5(self, url): import hashlib m = hashlib.md5() m.update(bytes(url, encoding="utf-8")) return m.hexdigest()
因此在当前十页爬取完毕之后,再往下一个深度,是十四页,再往下一个深度是十八页
标签:img 存在 date http md5 标签 print get etc
原文地址:https://www.cnblogs.com/traditional/p/9256410.html