常见的提取网页正文的方法

时间：2017-09-04 09:46:25 阅读：615 评论：0 收藏：0 [点我收藏+]

标签：比较网页 download bsp odi 结果 article body www

Python readability的使用：

from readability.readability import Document

import urllib

html = urllib.urlopen(url).read()

readable_article = Document(html).summary()

readable_title = Document(html).short_title()

最后抽取出来的readable_article是带HTML标签的文本。还需要进行clean html操作。如果需要得到纯文本内容，还需要做其他工作。

例如，提取正文

response = HtmlResponse(url=‘‘, body=readable_article, encoding=‘utf8‘)
hxs = HtmlXPathSelector(response)

html_content = ‘‘.join(hxs.select(‘//text()‘).extract()).strip()

不过这种方式有好多情况提取不到正文。

Python Newspaper的使用：

Newspaper: 这个库可以实现由网上下载到解析，一条龙服务：

核心示例代码如下所示：

from newspaper import Article

a = Article(‘http://www.chinanews.com/gj/2014/11-19/6791729.shtml, language=‘zh‘)

a.download()

a.parse()

结果：耗时会比较长，第一次执行耗时4s左右，解析效果也一般。

Python Goose的使用：

代码比较方便，但是有些网址没有解析出来。

示例代码如下所示：

1 from goose import Goose
2 from goose.text import StopWordsChinese
3 url = ‘http://www.chinanews.com/gj/2014/11-19/6791729.shtml‘
4 g = Goose({‘stipwords_class‘:StopWordsChinese})
5 article = g.extract(url = url)
6 print article.cleaned_text[:150]

结果：效果不好，有些网址解析不出来。

常见的提取网页正文的方法

标签：比较网页 download bsp odi 结果 article body www

原文地址：http://www.cnblogs.com/zhaobang/p/7472091.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行