python库： scrapy （深坑未填）

时间：2017-10-22 11:12:42 阅读：322 评论：0 收藏：0 [点我收藏+]

标签：requests doc pre log import 爬虫 logs pip3 int

scrapy　　一个快速高级的屏幕爬取及网页采集框架

http://scrapy.org/　　官网

https://docs.scrapy.org/en/latest/　　文档

安装：　　win7 安装 Scrapy：　　2017-10-19

当前环境：win7，python3.6.0，pyCharm4.5。 python目录是：c:/python3/

Scrapy依赖的库比较多，至少需要依赖库有Twisted 14.0，lxml 3.4，pyOpenSSL 0.14。

参考文章：http://www.cnblogs.com/liuliliuli2017/p/6746440.html 　　Python3环境安装Scrapy爬虫框架过程及常见错误

我在安装 Twisted 时遇到了问题。解决步骤如下：

1、http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted（重要：这个站点有非常多的whl文件！）　　到这里下载 . whl 文件

按说我机子是win764位的，本该用 Twisted-17.9.0-cp36-cp36m-win_amd64.whl，但是提示不让安装。只好瞎猫撞死耗子似的，又下载了 Twisted-17.9.0-cp36-cp36m-win32.whl 这个文件。把它放到 C:\Python3\Scripts\Twisted-17.9.0-cp36-cp36m-win32.whl

运行：python pip3.exe install Twisted-17.9.0-cp36-cp36m-win32.whl

然后再运行：python pip.exe install scrapy　　，就装上了。

学习中：

cd c:\Python3\zz\　　　　　　　　　　#  C:\Python3\zz\  ，是我放项目的文件夹
python C:/Python3/Scripts/scrapy.exe startproject plant　　# 建立一个叫做 plant的 爬虫项目

C:\Python3\zz\plant\

├ scrapy.cfg: 　　项目的配置文件
├ plant/: 　　该项目的 python 模块。之后您将在此加入代码。
├ plant/items.py: 　　项目中的 item 文件。
├ plant/pipelines.py: 　　项目中的 pipelines 文件。
├ plant/settings.py: 　　项目的设置文件。
└ plant/spiders/: 　　放置 spider 代码的目录。

编辑 items.py

import scrapy
class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

编写第一个爬虫(Spider)，创建文件 C:\Python3\zz\plant\plant\spiders\quotes_spider.py

下面这两步，是看教程： https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project　　，但是本机报错，明天再试

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            ‘http://quotes.toscrape.com/page/1/‘,
            ‘http://quotes.toscrape.com/page/2/‘,
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = ‘quotes-%s.html‘ % page
        with open(filename, ‘wb‘) as f:
            f.write(response.body)
        self.log(‘Saved file %s‘ % filename)

进入项目文件夹，运行：

cd c:\Python3\zz\plantscrapy crawl quotes

....

python库： scrapy （深坑未填）

标签：requests doc pre log import 爬虫 logs pip3 int

原文地址：http://www.cnblogs.com/qq21270/p/7707604.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行