标签:des style http io os ar java strong 数据
爬虫最基本的部分是要将网页下载,而最重要的部分是过滤 -- 获取我们需要的信息。
而scrapy正好提供了这个功能:
首先我们要定义items:
Itemsare containers that will be loaded with the scraped data; they work like simple python dicts but provide additional protection against populating undeclared fields, to prevent typos.
摘自官网,大意就是说iteams是用来存储抓取的数据结构,提供相对与python字典类型额外的类型保护。(这个具体的保护方式待研究)
示例如下:
project/items.py
import scrapyclass DmozItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()
然后我们需要编写Spider,抓取网页并选择信息,将其放入items之中。
示例如下:
import scrapyclass DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, ‘wb‘) as f: f.write(response.body)
说明:
name Spider的名字,这个名字要在这个项目之中唯一,原因稍后就知道了。
allowed_domains,这个是域名设置,即是否抓取其他域名,一般摄制成要start_urls中地址的域名即可
开始启动蜘蛛:
scrapy crawl dmoz
正常的化输出如下:
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial) 2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {} 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ... 2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened 2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 2014-01-23 18:13:09-0400 [dmoz] INFO: Closing spider (finished)
待续
标签:des style http io os ar java strong 数据
原文地址:http://my.oschina.net/u/1242185/blog/324256