标签:orm data imp wrap 爬取 spider 定义 mysql数据库 roc
做爬虫的人,一定离不开的一个框架就是scrapy框架,写小项目的时候可以用requests模块就能得到结果,但是当爬取的数据量大的时候,就一定要用到框架.
下面先练练手,用scrapy写一个爬取猫眼电影的程序,环境配置和scrapy安装略过
第一步肯定是终端运行创建爬虫项目和文件
1 # 创建爬虫项目 2 scrapy startproject Maoyan 3 cd Maoyan 4 # 创建爬虫文件 5 scrapy genspider maoyan maoyan.com
然后在产生的items.py文件夹中定义需要爬取的数据结构
1 name = scrapy.Field() 2 star = scrapy.Field() 3 time = scrapy.Field()
之后打开maoyan.py文件,编写爬虫文件,记得导入items.py文件的MaoyanItem类,并实例化
1 import scrapy 2 from ..items import MaoyanItem 3 ? 4 class MaoyanSpider(scrapy.Spider): 5 name = ‘maoyan3‘ 6 allowed_domains = [‘maoyan.com‘] 7 # 去掉start_urls变量 8 ? 9 # 重写start_requests()方法 10 def start_requests(self): 11 for offset in range(0,91,10): 12 url = ‘https://maoyan.com/board/4?offset={}‘.format(offset) 13 yield scrapy.Request(url=url,callback=self.parse) 14 ? 15 def parse(self, response): 16 # 给items.py中的类:MaoyanItem(scrapy.Item)实例化 17 item = MaoyanItem() 18 ? 19 # 基准xpath 20 dd_list = response.xpath(‘//dl[@class="board-wrapper"]/dd‘) 21 # 依次遍历 22 for dd in dd_list: 23 # 是在给items.py中那些类变量赋值 24 item[‘name‘] = dd.xpath(‘./a/@title‘).get().strip() 25 item[‘star‘] = dd.xpath(‘.//p[@class="star"]/text()‘).get().strip() 26 item[‘time‘] = dd.xpath(‘.//p[@class="releasetime"]/text()‘).get().strip() 27 ? 28 # 把item对象交给管道文件处理 29 yield item
定义管道文件pipelines.py,进行持久化储存
1 class MaoyanPipeline(object): 2 # item: 从爬虫文件maoyan.py中yield的item数据 3 def process_item(self, item, spider): 4 print(item[‘name‘],item[‘time‘],item[‘star‘]) 5 ? 6 return item 7 ? 8 ? 9 import pymysql 10 from .settings import * 11 ? 12 # 自定义管道 - MySQL数据库 13 class MaoyanMysqlPipeline(object): 14 # 爬虫项目开始运行时执行此函数 15 def open_spider(self,spider): 16 print(‘我是open_spider函数输出‘) 17 # 一般用于建立数据库连接 18 self.db = pymysql.connect( 19 host = MYSQL_HOST, 20 user = MYSQL_USER, 21 password = MYSQL_PWD, 22 database = MYSQL_DB, 23 charset = MYSQL_CHAR 24 ) 25 self.cursor = self.db.cursor() 26 ? 27 def process_item(self,item,spider): 28 ins = ‘insert into filmtab values(%s,%s,%s)‘ 29 # 因为execute()的第二个参数为列表 30 L = [ 31 item[‘name‘],item[‘star‘],item[‘time‘] 32 ] 33 self.cursor.execute(ins,L) 34 self.db.commit() 35 ? 36 return item 37 ? 38 # 爬虫项目结束时执行此函数 39 def close_spider(self,spider): 40 print(‘我是close_spider函数输出‘) 41 # 一般用于断开数据库连接 42 self.cursor.close() 43 self.db.close()
接下来就是修改配置文件settings.py
1 USER_AGENT = ‘Mozilla/5.0‘ 2 ROBOTSTXT_OBEY = False 3 DEFAULT_REQUEST_HEADERS = { 4 ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘, 5 ‘Accept-Language‘: ‘en‘, 6 } 7 ITEM_PIPELINES = { 8 ‘Maoyan.pipelines.MaoyanPipeline‘: 300, 9 ‘Maoyan.pipelines.MaoyanMysqlPipeline‘:200, 10 } 11 # 定义MySQL相关变量 12 MYSQL_HOST = ‘127.0.0.1‘ 13 MYSQL_USER = ‘root‘ 14 MYSQL_PWD = ‘123456‘ 15 MYSQL_DB = ‘maoyandb‘ 16 MYSQL_CHAR = ‘utf8‘
最后,是创建run.py文件,然后就可以运行了
1 from scrapy import cmdline 2 cmdline.execute(‘scrapy crawl maoyan‘.split())
标签:orm data imp wrap 爬取 spider 定义 mysql数据库 roc
原文地址:https://www.cnblogs.com/lattesea/p/11756552.html