标签:兴趣 path tps 中间 class extract 可用性 free div
原因爬取某站: 则么试都没问题,代码提取没问题。
IP = response.xpath(‘//*[@class="mimvp-tbl free-proxylist-tbl"]/tbody/tr/td[2]//text()‘).extract() port = response.xpath(‘//*[@class="mimvp-tbl free-proxylist-tbl"]/tbody/tr/td[3]/img//@src‘).extract() for i in range(len(IP)): # PROXY = "http://" + IP[i] + ":" + port[i] # 演示合成网站有兴趣自己直接 在这里检测IP可用性# 。 try: a = port[i] # link = ‘https://proxy.mimvp.com‘ # links = link + a + ‘.png‘ # print(links) b = re.findall(r"W12.*", a) #数据清洗 b = ‘‘.join(b) # self.port_sheng(b, links) except: print(‘失败‘) try: t = zidian.zi_dian() port = str(t[b]) print(‘代理‘) c = IP[i] if ‘*‘ in c: for i in range(256): # IP_1 = c.replace(‘***‘, str(i)) IP_1 = re.sub(‘[*]+‘, str(i), c) yield { ‘imgname‘: ‘‘, ‘imgurl‘: ‘‘, ‘IP‘:IP_1, ‘port‘:port, } else: yield { ‘imgname‘: ‘‘, ‘imgurl‘: ‘‘, ‘IP‘:c, ‘port‘:port, } except: port = ‘9999‘ print(‘代理‘) c = IP[i] if ‘*‘ in c: for i in range(256): # IP_1 = c.replace(‘***‘, str(i)) IP_1 = re.sub(‘[*]+‘, str(i), c) yield { ‘imgname‘: ‘‘, ‘imgurl‘: ‘‘, ‘IP‘: IP_1, ‘port‘: port, } else: yield { ‘imgname‘: ‘‘, ‘imgurl‘: ‘‘, ‘IP‘: c, ‘port‘: port,}
不管则么试 ,前面IP 字段都是能爬到的,后面port字段没问题,但是一旦 port字段放代码里清洗。port就会出现 IndexError: string index out of range 报错,不提取,
尝试了4天,刚刚突发奇想,数据不清洗了,在中间
IP = response.xpath(‘//*[@class="mimvp-tbl free-proxylist-tbl"]/tbody/tr/td[2]//text()‘).extract() port = response.xpath(‘//*[@class="mimvp-tbl free-proxylist-tbl"]/tbody/tr/td[3]/img//@src‘).extract() for i in range(len(IP)): a= IP[i] b=port[I]
yield {
‘imgname‘: ‘‘,
‘imgurl‘: ‘‘,
‘IP‘:a,
‘port‘:b,
}
不行还是报错,port字段就是不提取。
最终尝试了下:
IP = response.xpath(‘//*[@class="mimvp-tbl free-proxylist-tbl"]/tbody/tr/td[2]//text()‘).extract() port = response.xpath(‘//*[@class="mimvp-tbl free-proxylist-tbl"]/tbody/tr/td[3]/img//@src‘).extract() for i in range(len(IP)):yield { ‘imgname‘: ‘‘, ‘imgurl‘: ‘‘, ‘IP‘:IP[i], ‘port‘:port[i], }
就这样完美,把port抓取到,这样数据从新抓取到的地方清洗。
以前从未碰到过,这个项目一共30个网址,就这个网站变态,不知道啥原因,就会报错,以后看到这样就把数据提出来,过后清洗。
Scrapy 爬取重大注意事项!! 因为这个困扰了我4天,头发都掉光了。。
标签:兴趣 path tps 中间 class extract 可用性 free div
原文地址:https://www.cnblogs.com/aotumandaren/p/14127059.html