码迷,mamicode.com
首页 > 其他好文 > 详细

scrapy爬取猫眼电影排行榜

时间:2019-10-29 10:01:18      阅读:122      评论:0      收藏:0      [点我收藏+]

标签:orm   data   imp   wrap   爬取   spider   定义   mysql数据库   roc   

做爬虫的人,一定离不开的一个框架就是scrapy框架,写小项目的时候可以用requests模块就能得到结果,但是当爬取的数据量大的时候,就一定要用到框架.

下面先练练手,用scrapy写一个爬取猫眼电影的程序,环境配置和scrapy安装略过

第一步肯定是终端运行创建爬虫项目和文件

1 # 创建爬虫项目
2 scrapy startproject Maoyan
3 cd Maoyan
4 # 创建爬虫文件
5 scrapy genspider maoyan maoyan.com

然后在产生的items.py文件夹中定义需要爬取的数据结构

1 name = scrapy.Field()
2 star = scrapy.Field()
3 time = scrapy.Field()

之后打开maoyan.py文件,编写爬虫文件,记得导入items.py文件的MaoyanItem类,并实例化

 1 import scrapy
 2 from ..items import MaoyanItem
 3 ?
 4 class MaoyanSpider(scrapy.Spider):
 5     name = maoyan3
 6     allowed_domains = [maoyan.com]
 7     # 去掉start_urls变量
 8 ?
 9     # 重写start_requests()方法
10     def start_requests(self):
11         for offset in range(0,91,10):
12             url = https://maoyan.com/board/4?offset={}.format(offset)
13             yield scrapy.Request(url=url,callback=self.parse)
14 ?
15     def parse(self, response):
16         # 给items.py中的类:MaoyanItem(scrapy.Item)实例化
17         item = MaoyanItem()
18 ?
19         # 基准xpath
20         dd_list = response.xpath(//dl[@class="board-wrapper"]/dd)
21         # 依次遍历
22         for dd in dd_list:
23             # 是在给items.py中那些类变量赋值
24             item[name] = dd.xpath(./a/@title).get().strip()
25             item[star] = dd.xpath(.//p[@class="star"]/text()).get().strip()
26             item[time] = dd.xpath(.//p[@class="releasetime"]/text()).get().strip()
27 ?
28             # 把item对象交给管道文件处理
29             yield item

定义管道文件pipelines.py,进行持久化储存

 1 class MaoyanPipeline(object):
 2     # item: 从爬虫文件maoyan.py中yield的item数据
 3     def process_item(self, item, spider):
 4         print(item[name],item[time],item[star])
 5 ?
 6         return item
 7 ?
 8 ?
 9 import pymysql
10 from .settings import *
11 ?
12 # 自定义管道 - MySQL数据库
13 class MaoyanMysqlPipeline(object):
14     # 爬虫项目开始运行时执行此函数
15     def open_spider(self,spider):
16         print(我是open_spider函数输出)
17         # 一般用于建立数据库连接
18         self.db = pymysql.connect(
19             host = MYSQL_HOST,
20             user = MYSQL_USER,
21             password = MYSQL_PWD,
22             database = MYSQL_DB,
23             charset = MYSQL_CHAR
24         )
25         self.cursor = self.db.cursor()
26 ?
27     def process_item(self,item,spider):
28         ins = insert into filmtab values(%s,%s,%s)
29         # 因为execute()的第二个参数为列表
30         L = [
31             item[name],item[star],item[time]
32         ]
33         self.cursor.execute(ins,L)
34         self.db.commit()
35 ?
36         return item
37 ?
38     # 爬虫项目结束时执行此函数
39     def close_spider(self,spider):
40         print(我是close_spider函数输出)
41         # 一般用于断开数据库连接
42         self.cursor.close()
43         self.db.close()

接下来就是修改配置文件settings.py

 1 USER_AGENT = Mozilla/5.0
 2 ROBOTSTXT_OBEY = False
 3 DEFAULT_REQUEST_HEADERS = {
 4   Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8,
 5   Accept-Language: en,
 6 }
 7 ITEM_PIPELINES = {
 8    Maoyan.pipelines.MaoyanPipeline: 300,
 9    Maoyan.pipelines.MaoyanMysqlPipeline:200,
10 }
11 # 定义MySQL相关变量
12 MYSQL_HOST = 127.0.0.1
13 MYSQL_USER = root
14 MYSQL_PWD = 123456
15 MYSQL_DB = maoyandb
16 MYSQL_CHAR = utf8

最后,是创建run.py文件,然后就可以运行了

1 from scrapy import cmdline
2 cmdline.execute(scrapy crawl maoyan.split())

 

scrapy爬取猫眼电影排行榜

标签:orm   data   imp   wrap   爬取   spider   定义   mysql数据库   roc   

原文地址:https://www.cnblogs.com/lattesea/p/11756552.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!