Scrapy爬虫 -- 01

时间：2014-10-02 14:32:53 阅读：141 评论：0 收藏：0 [点我收藏+]

Scrapy，Python开发的一个快速,高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结取结构化的数据。

--from wiki

说白了就是基于python的爬虫框架。

安装：

ubuntu 14.04
python2.7（python3不支持，不是作者懒，是scrapy的框架依赖twisted还没有完全迁移到python3）
pip

sudo pip2 install scrapy

注意：虽然pip3也能装上scrapy，但是缺少支持库，无法使用。。。乖乖python2吧

使用：

1、新建工程test

scrapy startproject tutoria

这样就会创建这样一个目录结构：

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ..

官网的解释如下：

scrapy.cfg: the project configuration file（项目配置文件）
tutorial/: the project’s python module, you’ll later import your code from here.（项目中的定制部分，我不知道怎么翻译好）
tutorial/items.py: the project’s items file.（项目的items文件，其实就是要抓取的数据的结构定义）
tutorial/pipelines.py: the project’s pipelines file.（项目的pipelines文件，在这里可以定义将抓取的数据导出方式，pip中有scrapy-mongodb的pipelines，可以将抓取的数据直接导出到pipeline之中。）
tutorial/settings.py: the project’s settings file.（项目的配置文件）
tutorial/spiders/: a directory where you’ll later put your spiders.（存放爬虫的目录，一般用来将网页爬下来）

待续。。。

Scrapy爬虫 -- 01

标签：style http io 使用 ar 文件数据 sp art

原文地址：http://my.oschina.net/u/1242185/blog/323754

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行