xsoup

时间：2016-05-11 13:32:27 阅读：2594 评论：0 收藏：0 [点我收藏+]

标签：

Xsoup 0.2.0 发布，HTML 抽取器

黄亿华发布于： 2014年03月11日 (4评)

Xsoup 是一款基于 Jsoup 开发的，使用XPath抽取Html元素的工具。它被用于作者的爬虫框架 WebMagic中，进行XPath 解析和抽取。

此次更新主要增加了一些XPath语法的支持。

增加contains支持 #2：

?

1

//div[contains(@id,‘test‘)]

增加筛选条件的逻辑运算支持(and/or) #4：

//div[@id=‘test‘ or @class=‘test‘]
//div[@id=‘test‘ and @class=‘test‘]
//div[@id=‘test‘ and @class=‘test‘ or @id=‘test1‘]
//div[@id=‘test‘ and (@class=‘test‘ or @id=‘test1‘)]

增加整个XPath的或支持 #6：

?

1

//div[@id=‘test‘]/text() | //div[@class=‘test‘]/div/text()

此次升级与Xsoup 0.1.0 API兼容，WebMagic 0.3.0以上的用户可以直接在项目添加依赖即可使用新语法。

<dependency>
  <groupId>us.codecraft</groupId>
  <artifactId>xsoup</artifactId>
  <version>0.2.0</version>
</dependency>

相关链接

Xsoup 的详细介绍：请点这里
Xsoup 的下载地址：请点这里 https://github.com/code4craft/xsoup
http://www.oschina.net/question/tag/xsoup?show=hot

为什么用Jsoup 解析不到 <tr> 下面的<td> http://www.oschina.net/question/1271820_131887

得到<td></td>后在外围加上<table></table>.

xsoup是基于Jsoup开发的一款XPath解析器。

之前WebMagic使用的解析器是HtmlCleaner，使用过程存在一些问题。主要问题是XPath出错定位不准确，并且其不太合理的代码结构，也难以进行定制。实现了Xsoup。Xsoup的性能比HtmlCleaner要快一倍以上。

Xsoup发展到现在，已经支持爬虫常用的语法，以下是一些已支持的语法对照表：

Name	Expression	Support
nodename	nodename	yes
immediate parent	/	yes
parent	//	yes
attribute	[@key=value]	yes
nth child	tag[n]	yes
attribute	/@key	yes
wildcard in tagname	/	yes
wildcard in attribute	/[@]	yes
function	function()	part
or	a \| b	yes since 0.2.0
parent in path	. or ..	no
predicates	price>35	no
predicates logic	@class=a or @class=b	yes since 0.2.0

定义几个很方便的XPath函数。但是请注意，这些函数式标准XPath没有的。

Expression	Description	XPath1.0
text(n)	第n个直接文本子节点，为0表示所有	text() only
allText()	所有的直接和间接文本子节点	not support
tidyText()	所有的直接和间接文本子节点，并将一些标签替换为换行，使纯文本显示更整洁	not support
html()	内部html，不包括标签的html本身	not support
outerHtml()	内部html，包括标签的html本身	not support
regex(@attr,expr,group)	这里@attr和group均可选，默认是group0	not support

xsoup

标签：

原文地址：http://www.cnblogs.com/destim/p/5481461.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行