标签:ict 股票 ret upd dict 修改 except continue sspi
周末了解了scrapy框架,对上次使用requests+bs4+re进行股票爬虫(http://www.cnblogs.com/wyfighting/p/7497985.html)的代码,使用scrapy进行了重写。
目录结构:
stocks.py文件代码
1 # -*- coding: utf-8 -*- 2 import scrapy 3 import re 4 5 6 class StocksSpider(scrapy.Spider): 7 name = "stocks" 8 start_urls = [‘http://quote.eastmoney.com/stocklist.html‘] 9 10 def parse(self, response): 11 for href in response.css(‘a::attr(href)‘).extract(): 12 try: 13 stock = re.findall(r"[s][hz]\d{6}", href)[0] 14 url = ‘https://gupiao.baidu.com/stock/‘ + stock + ‘.html‘ 15 yield scrapy.Request(url, callback=self.parse_stock) 16 except: 17 continue 18 19 def parse_stock(self, response): 20 infoDict = {} 21 stockInfo = response.css(‘.stock-bets‘) 22 name = stockInfo.css(‘.bets-name‘).extract()[0] 23 keyList = stockInfo.css(‘dt‘).extract() 24 valueList = stockInfo.css(‘dd‘).extract() 25 for i in range(len(keyList)): 26 key = re.findall(r‘>.*</dt>‘, keyList[i])[0][1:-5] 27 try: 28 val = re.findall(r‘\d+\.?.*</dd>‘, valueList[i])[0][0:-5] 29 except: 30 val = ‘--‘ 31 infoDict[key]=val 32 33 infoDict.update( 34 {‘股票名称‘: re.findall(‘\s.*\(‘,name)[0].split()[0] + 35 re.findall(‘\>.*\<‘, name)[0][1:-1]}) 36 yield infoDict
pipelines.py文件代码:
1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 8 9 class BaidustocksPipeline(object): 10 def process_item(self, item, spider): 11 return item 12 13 class BaidustocksInfoPipeline(object): 14 def open_spider(self, spider): 15 self.f = open(‘BaiduStockInfo.txt‘, ‘w‘) 16 17 def close_spider(self, spider): 18 self.f.close() 19 20 def process_item(self, item, spider): 21 try: 22 line = str(dict(item)) + ‘\n‘ 23 self.f.write(line) 24 except: 25 pass 26 return item
settings.py文件中被修改的区域:
1 # Configure item pipelines 2 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 3 ITEM_PIPELINES = { 4 ‘BaiduStocks.pipelines.BaidustocksInfoPipeline‘: 300, 5 }
标签:ict 股票 ret upd dict 修改 except continue sspi
原文地址:http://www.cnblogs.com/wyfighting/p/7541415.html