浅谈HtmlParser

时间：2015-01-26 20:57:14 阅读：221 评论：0 收藏：0 [点我收藏+]

标签：

　　使用Heritrix抓取到自己所需的网页后，还需要对网页中的内容进行分类等操作，这个时候就需要用到htmlparser，但是使用htmlparser并不是那么容易！因为相关的文档比较少，很多更能需要开发者自己去摸索，去发掘！

　　不过这里给大家提供一个比较好的网站（htmlparser的API）：http://tool.oschina.net/apidocs/apidoc?api=HTMLParser，这个API是英文版的，英语不好的这时就要逼迫自己看下去了。

　　HTMLParser的核心模块是org.htmlparser.Parser类，这个类实际完成了对于HTML页面的分析工作。这个类有下面几个构造函数：

public Parser ();
public Parser (Lexer lexer, ParserFeedback fb);
public Parser (URLConnection connection, ParserFeedback fb) throws ParserException;
public Parser (String resource, ParserFeedback feedback) throws ParserException;
public Parser (String resource) throws ParserException;
public Parser (Lexer lexer);
public Parser (URLConnection connection) throws ParserException;

和一个静态类

public static Parser createParser (String html, String charset);

　　对于大多数使用者来说，使用最多的是通过一个URLConnection或者一个保存有网页内容的字符串来初始化Parser，或者使用静态函数来生成一个Parser对象。ParserFeedback的代码很简单，是针对调试和跟踪分析过程的，一般不需要改变。而使用Lexer则是一个相对比较高级的话题，放到以后再讨论吧。
　　这里比较有趣的一点是，如果需要设置页面的编码方式的话，不使用Lexer就只有静态函数一个方法了。对于大多数中文页面来说，好像这是应该用得比较多的一个方法。

下面是初始化Parser的例子（通过打开一个网页的URL，中间的OpenFile方法是在打开一个本地的html文件时使用的）。

【加载的网页文件：index.html】

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
    <head>
        <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
        <title>百度</title>
        <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
    </head>
    <body>
        <div  align = "center" class = "photo" >
            <img src = "../image/baidu.PNG" >
        </div>
        <div align = "center" class = "body">
            <table cellpadding="8">
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
                </td>
                <td>
                    <font color = "black">网页</font>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
                </td>
            </table>
            <input class = "input" >
        </div>
    </body>

</html>

View Code

【源码：htmlparser_1.java】

 1 import java.io.BufferedReader;
 2 import java.io.File;
 3 import java.io.FileInputStream;
 4 import java.io.InputStreamReader;
 5 import java.net.HttpURLConnection;
 6 import java.net.URL;
 7 import org.htmlparser.Parser;
 8 import org.htmlparser.visitors.TextExtractingVisitor;
 9 
10 public class Main {
11     private static String ENCODE = "GBK";
12     private static void message(String msg) {
13         // TODO Auto-generated method stub
14         try {
15             System.out.println(new String(msg.getBytes(ENCODE), System
16                     .getProperty("file.encoding")));
17         } catch (Exception e) {
18             // TODO: handle exception
19             e.printStackTrace();
20         }
21     }
22     
23     /*
24      * 打开一个文件
25      */
26     public static String OpenFile(String FileName) {
27         try {
28             File mFile = new File(FileName);
29             FileInputStream mFileInputStream = new FileInputStream(mFile);
30             InputStreamReader mInputStreamReader = new InputStreamReader(
31                     mFileInputStream, ENCODE);
32             BufferedReader mBufferedReader = new BufferedReader(
33                     mInputStreamReader);
34             String mContent = "";
35             String mTemp = "";
36             while ((mTemp = mBufferedReader.readLine()) != null) {
37                 mContent += mTemp + "\n";
38             }
39             mBufferedReader.close();
40         } catch (Exception e) {
41             // TODO: handle exception
42             e.printStackTrace();
43             return "";
44         }
45         return FileName;
46     }
47 
48     /*
49      * main方法
50      */
51     public static void main(String[] args) {
52         // String mContent=OpenFile("");
53         try {
54             Parser mParser = new Parser((HttpURLConnection) (new URL(
55                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());
56             TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor();
57             mParser.visitAllNodesWith(mExtractingVisitor);
58             String textInPage = mExtractingVisitor.getExtractedText();
59             message(textInPage);
60         } catch (Exception e) {
61             // TODO: handle exception
62             e.printStackTrace();
63         }
64     }
65 
66 }

测试输出结果：

 1     
 2         
 3         百度
 4         
 5     
 6     
 7         
 8             
 9         
10         
11             
12                 
13                     新闻
14                 
15                 
16                     网页
17                 
18                 
19                     贴吧
20                 
21                 
22                     知道
23                 
24                 
25                     音乐
26                 
27                 
28                     图片
29                 
30                 
31                     视频
32                 
33                 
34                     地图
35                 
36             
37             
38         
39

View Code

HTMLParser将解析过的信息保存为一个树的结构。Node是信息保存的数据类型基础。

请看Node的定义：
public interface Node extends Cloneable;

Node中包含的方法有几类：

对于树型结构进行遍历的函数，这些函数最容易理解：

Node getParent ()：取得父节点
NodeList getChildren ()：取得子节点的列表
Node getFirstChild ()：取得第一个子节点
Node getLastChild ()：取得最后一个子节点
Node getPreviousSibling ()：取得前一个兄弟（不好意思，英文是兄弟姐妹，直译太麻烦而且不符合习惯，对不起女同胞了）
Node getNextSibling ()：取得下一个兄弟节点

取得Node内容的函数：

String getText ()：取得文本
String toPlainTextString()：取得纯文本信息。
String toHtml () ：取得HTML信息（原始HTML）
String toHtml (boolean verbatim)：取得HTML信息（原始HTML）
String toString ()：取得字符串信息（原始HTML）
Page getPage ()：取得这个Node对应的Page对象
int getStartPosition ()：取得这个Node在HTML页面中的起始位置
int getEndPosition ()：取得这个Node在HTML页面中的结束位置

用于Filter过滤的函数：

void collectInto (NodeList list, NodeFilter filter)：基于filter的条件对于这个节点进行过滤，符合条件的节点放到list中。

用于Visitor遍历的函数：

void accept (NodeVisitor visitor)：对这个Node应用visitor

用于修改内容的函数，这类用得比较少：

void setPage (Page page)：设置这个Node对应的Page对象
void setText (String text)：设置文本
void setChildren (NodeList children)：设置子节点列表

其他函数：

void doSemanticAction ()： 执行这个Node对应的操作（只有少数Tag有对应的操作）
Object clone ()：　接口Clone的抽象函数。

实际我们用HTMLParser最多的是处理HTML页面，Filter或Visitor相关的函数是必须的，然后第一类和第二类函数是用得最多的。第一类函数比较容易理解，下面用例子说明一下第二类函数。

【源码：htmlparser_2.java】

 1 import java.io.BufferedReader;
 2 import java.io.File;
 3 import java.io.FileInputStream;
 4 import java.io.InputStreamReader;
 5 import java.net.HttpURLConnection;
 6 import java.net.URL;
 7 import org.htmlparser.Node;
 8 import org.htmlparser.Parser;
 9 import org.htmlparser.util.NodeIterator;
10 import org.htmlparser.visitors.TextExtractingVisitor;
11 import org.omg.CosNaming.NamingContextPackage.NotEmpty;
12 
13 public class Main {
14     private static String ENCODE = "utf-8";
15     private static void message(String msg) {
16         // TODO Auto-generated method stub
17         try {
18             System.out.println(new String(msg.getBytes(ENCODE), System
19                     .getProperty("file.encoding")));
20         } catch (Exception e) {
21             // TODO: handle exception
22             e.printStackTrace();
23         }
24     }
25     
26     /*
27      * 打开一个文件
28      */
29     public static String OpenFile(String FileName) {
30         try {
31             File mFile = new File(FileName);
32             FileInputStream mFileInputStream = new FileInputStream(mFile);
33             InputStreamReader mInputStreamReader = new InputStreamReader(
34                     mFileInputStream, ENCODE);
35             BufferedReader mBufferedReader = new BufferedReader(
36                     mInputStreamReader);
37             String mContent = "";
38             String mTemp = "";
39             while ((mTemp = mBufferedReader.readLine()) != null) {
40                 mContent += mTemp + "\n";
41             }
42             mBufferedReader.close();
43         } catch (Exception e) {
44             // TODO: handle exception
45             e.printStackTrace();
46             return "";
47         }
48         return FileName;
49     }
50 
51     /*
52      * main方法
53      */
54     public static void main(String[] args) {
55         // String mContent=OpenFile("");
56         try {
57             Parser mParser = new Parser((HttpURLConnection) (new URL(
58                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());
59 //            TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor();
60 //            mParser.visitAllNodesWith(mExtractingVisitor);
61 //            String textInPage = mExtractingVisitor.getExtractedText();
62 //            message(textInPage);
63             
64             for (NodeIterator i = mParser.elements(); i.hasMoreNodes();) {
65                 Node node = i.nextNode();
66                 message("getText:"+node.getText());
67                 message("getPlainText:"+node.toPlainTextString());
68                 message("toHtml:"+node.toHtml());
69                 message("toHtml(true):"+node.toHtml(true));
70                 message("tohtml(false):"+node.toHtml(false));
71                 message("toString:"+node.toString());
72                 message("==============================");
73             }
74         } catch (Exception e) {
75             // TODO: handle exception
76             e.printStackTrace();
77         }
78     }
79 }

测试输出结果：

  1 getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
  2 getPlainText:
  3 toHtml:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4 toHtml(true):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  5 tohtml(false):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  6 toString:Doctype Tag : !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; begins at : 0; ends at : 121
  7 ==============================
  8 getText:
  9 
 10 getPlainText:
 11 
 12 toHtml:
 13 
 14 toHtml(true):
 15 
 16 tohtml(false):
 17 
 18 toString:Txt (121[0,121],123[1,0]): \n
 19 ==============================
 20 getText:html
 21 getPlainText:
 22     
 23         
 24         百度
 25         
 26     
 27     
 28         
 29             
 30         
 31         
 32             
 33                 
 34                     新闻
 35                 
 36                 
 37                     网页
 38                 
 39                 
 40                     贴吧
 41                 
 42                 
 43                     知道
 44                 
 45                 
 46                     音乐
 47                 
 48                 
 49                     图片
 50                 
 51                 
 52                     视频
 53                 
 54                 
 55                     地图
 56                 
 57             
 58             
 59         
 60     
 61 
 62 
 63 toHtml:<html>
 64     <head>
 65         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
 66         <title>百度</title>
 67         <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
 68     </head>
 69     <body>
 70         <div  align = "center" class = "photo" >
 71             <img src = "../image/baidu.PNG" >
 72         </div>
 73         <div align = "center" class = "body">
 74             <table cellpadding="8">
 75                 <td>
 76                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
 77                 </td>
 78                 <td>
 79                     <font color = "black">网页</font>
 80                 </td>
 81                 <td>
 82                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
 83                 </td>
 84                 <td>
 85                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
 86                 </td>
 87                 <td>
 88                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
 89                 </td>
 90                 <td>
 91                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
 92                 </td>
 93                 <td>
 94                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
 95                 </td>
 96                 <td>
 97                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
 98                 </td>
 99             </table>
100             <input class = "input" >
101         </div>
102     </body>
103 
104 </html>
105 toHtml(true):<html>
106     <head>
107         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
108         <title>百度</title>
109         <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
110     </head>
111     <body>
112         <div  align = "center" class = "photo" >
113             <img src = "../image/baidu.PNG" >
114         </div>
115         <div align = "center" class = "body">
116             <table cellpadding="8">
117                 <td>
118                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
119                 </td>
120                 <td>
121                     <font color = "black">网页</font>
122                 </td>
123                 <td>
124                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
125                 </td>
126                 <td>
127                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
128                 </td>
129                 <td>
130                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
131                 </td>
132                 <td>
133                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
134                 </td>
135                 <td>
136                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
137                 </td>
138                 <td>
139                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
140                 </td>
141             </table>
142             <input class = "input" >
143         </div>
144     </body>
145 
146 </html>
147 tohtml(false):<html>
148     <head>
149         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
150         <title>百度</title>
151         <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
152     </head>
153     <body>
154         <div  align = "center" class = "photo" >
155             <img src = "../image/baidu.PNG" >
156         </div>
157         <div align = "center" class = "body">
158             <table cellpadding="8">
159                 <td>
160                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
161                 </td>
162                 <td>
163                     <font color = "black">网页</font>
164                 </td>
165                 <td>
166                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
167                 </td>
168                 <td>
169                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
170                 </td>
171                 <td>
172                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
173                 </td>
174                 <td>
175                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
176                 </td>
177                 <td>
178                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
179                 </td>
180                 <td>
181                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
182                 </td>
183             </table>
184             <input class = "input" >
185         </div>
186     </body>
187 
188 </html>
189 toString:Tag (123[1,0],129[1,6]): html
190   Txt (129[1,6],132[2,1]): \n\t
191   Tag (132[2,1],138[2,7]): head
192     Txt (138[2,7],142[3,2]): \n\t\t
193     Tag (142[3,2],216[3,76]): meta http-equiv = "Content-Type" content = "text/ht...
194     Txt (216[3,76],220[4,2]): \n\t\t
195     Tag (220[4,2],227[4,9]): title
196       Txt (227[4,9],229[4,11]): 百度
197       End (229[4,11],237[4,19]): /title
198     Txt (237[4,19],241[5,2]): \n\t\t
199     Tag (241[5,2],302[5,63]): link href = "a_1.css" rel = "stylesheet" type = "te...
200     Txt (302[5,63],305[6,1]): \n\t
201     End (305[6,1],312[6,8]): /head
202   Txt (312[6,8],315[7,1]): \n\t
203   Tag (315[7,1],321[7,7]): body
204     Txt (321[7,7],325[8,2]): \n\t\t
205     Tag (325[8,2],365[8,42]): div  align = "center" class = "photo" 
206       Txt (365[8,42],370[9,3]): \n\t\t\t
207       Tag (370[9,3],403[9,36]): img src = "../image/baidu.PNG" 
208       Txt (403[9,36],407[10,2]): \n\t\t
209       End (407[10,2],413[10,8]): /div
210     Txt (413[10,8],417[11,2]): \n\t\t
211     Tag (417[11,2],454[11,39]): div align = "center" class = "body"
212       Txt (454[11,39],459[12,3]): \n\t\t\t
213       Tag (459[12,3],482[12,26]): table cellpadding="8"
214         Txt (482[12,26],488[13,4]): \n\t\t\t\t
215         Tag (488[13,4],492[13,8]): td
216           Txt (492[13,8],499[14,5]): \n\t\t\t\t\t
217           Tag (499[14,5],552[14,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
218             Txt (552[14,58],554[14,60]): 新闻
219             End (554[14,60],558[14,64]): /a
220           Txt (558[14,64],564[15,4]): \n\t\t\t\t
221           End (564[15,4],569[15,9]): /td
222         Txt (569[15,9],575[16,4]): \n\t\t\t\t
223         Tag (575[16,4],579[16,8]): td
224           Txt (579[16,8],586[17,5]): \n\t\t\t\t\t
225           Tag (586[17,5],608[17,27]): font color = "black"
226           Txt (608[17,27],610[17,29]): 网页
227           End (610[17,29],617[17,36]): /font
228           Txt (617[17,36],623[18,4]): \n\t\t\t\t
229           End (623[18,4],628[18,9]): /td
230         Txt (628[18,9],634[19,4]): \n\t\t\t\t
231         Tag (634[19,4],638[19,8]): td
232           Txt (638[19,8],645[20,5]): \n\t\t\t\t\t
233           Tag (645[20,5],698[20,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
234             Txt (698[20,58],700[20,60]): 贴吧
235             End (700[20,60],704[20,64]): /a
236           Txt (704[20,64],710[21,4]): \n\t\t\t\t
237           End (710[21,4],715[21,9]): /td
238         Txt (715[21,9],721[22,4]): \n\t\t\t\t
239         Tag (721[22,4],725[22,8]): td
240           Txt (725[22,8],732[23,5]): \n\t\t\t\t\t
241           Tag (732[23,5],785[23,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
242             Txt (785[23,58],787[23,60]): 知道
243             End (787[23,60],791[23,64]): /a
244           Txt (791[23,64],797[24,4]): \n\t\t\t\t
245           End (797[24,4],802[24,9]): /td
246         Txt (802[24,9],808[25,4]): \n\t\t\t\t
247         Tag (808[25,4],812[25,8]): td
248           Txt (812[25,8],819[26,5]): \n\t\t\t\t\t
249           Tag (819[26,5],872[26,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
250             Txt (872[26,58],874[26,60]): 音乐
251             End (874[26,60],878[26,64]): /a
252           Txt (878[26,64],884[27,4]): \n\t\t\t\t
253           End (884[27,4],889[27,9]): /td
254         Txt (889[27,9],895[28,4]): \n\t\t\t\t
255         Tag (895[28,4],899[28,8]): td
256           Txt (899[28,8],906[29,5]): \n\t\t\t\t\t
257           Tag (906[29,5],959[29,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
258             Txt (959[29,58],961[29,60]): 图片
259             End (961[29,60],965[29,64]): /a
260           Txt (965[29,64],971[30,4]): \n\t\t\t\t
261           End (971[30,4],976[30,9]): /td
262         Txt (976[30,9],982[31,4]): \n\t\t\t\t
263         Tag (982[31,4],986[31,8]): td
264           Txt (986[31,8],993[32,5]): \n\t\t\t\t\t
265           Tag (993[32,5],1046[32,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
266             Txt (1046[32,58],1048[32,60]): 视频
267             End (1048[32,60],1052[32,64]): /a
268           Txt (1052[32,64],1058[33,4]): \n\t\t\t\t
269           End (1058[33,4],1063[33,9]): /td
270         Txt (1063[33,9],1069[34,4]): \n\t\t\t\t
271         Tag (1069[34,4],1073[34,8]): td
272           Txt (1073[34,8],1080[35,5]): \n\t\t\t\t\t
273           Tag (1080[35,5],1133[35,58]): a href = "#" target = _blank title = "欢迎来到&#10百...
274             Txt (1133[35,58],1135[35,60]): 地图
275             End (1135[35,60],1139[35,64]): /a
276           Txt (1139[35,64],1145[36,4]): \n\t\t\t\t
277           End (1145[36,4],1150[36,9]): /td
278         Txt (1150[36,9],1155[37,3]): \n\t\t\t
279         End (1155[37,3],1163[37,11]): /table
280       Txt (1163[37,11],1168[38,3]): \n\t\t\t
281       Tag (1168[38,3],1192[38,27]): input class = "input" 
282       Txt (1192[38,27],1196[39,2]): \n\t\t
283       End (1196[39,2],1202[39,8]): /div
284     Txt (1202[39,8],1205[40,1]): \n\t
285     End (1205[40,1],1212[40,8]): /body
286   Txt (1212[40,8],1216[42,0]): \n\n
287   End (1216[42,0],1223[42,7]): /html
288 
289 ==============================

View Code

　　对于第一个Node的内容，对应的就是第一行<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">，从这个输出结果中，也可以看出内容的树状结构。或者说是树林结构。在Page内容的第一层Tag，如DOCTYPE，head和html，分别形成了一个最高层的Node节点（很多人可能对第二个和第四个Node的内容有点奇怪。实际上这两个Node就是两个换行符号。HTMLParser把HTML页面内容中的所有换行，空格，Tab等都转换成了相应的Tag，所以就出现了这样的Node。虽然内容少但是级别高，呵呵）

　　getPlainTextString是把用户可以看到的内容都包含了。有趣的有两点，一是<head>标签中的Title内容是在plainText中的，可能在标题中可见的也算可见吧。另外就是象前面说的，HTML内容中的换行符什么的，也都成了plainText，这个逻辑上好像有点问题。

　　另外可能大家发现toHtml，toHtml(true)和toHtml(false)的结果没什么区别。实际也是这样的，如果跟踪HTMLParser的代码就可以发现，Node的子类是AbstractNode，其中实现了toHtml()的代码，直接调用toHtml(false)，而AbstractNode的三个子类RemarkNode，TagNode和TextNode中，toHtml(boolean verbatim)的实现中，都没有处理verbatim参数，所以三个函数的结果是一模一样的。如果你不需要实现你自己的什么特殊处理，简单使用toHtml就可以了。

HTML的Node类继承关系如下图（这个是从别的文章Copy的）

技术分享

浅谈HtmlParser

标签：

原文地址：http://www.cnblogs.com/zhjsll/p/4251153.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行