码迷,mamicode.com
首页 > 其他好文 > 详细

[scrapy] scrapy 使用goose作为正文提取

时间:2015-08-25 19:29:45      阅读:492      评论:0      收藏:0      [点我收藏+]

标签:

import scrapy
from goose import Goose

class Article(scrapy.Item):
    title = scrapy.Field()
    text = scrapy.Field()

class MyGooseSpider(scrapy.Spider):
    name = ‘goose‘
    start_urls = [
        ‘http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/‘,
        ‘http://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/‘,
    ]

    def parse(self, response):
        article = Goose().extract(raw_html=response.body)
        yield Article(title=article.title, text=article.cleaned_text)

转自:http://stackoverflow.com/questions/26940002/can-i-use-scrapy-with-goose

[scrapy] scrapy 使用goose作为正文提取

标签:

原文地址:http://www.cnblogs.com/bushe/p/4757981.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!