原文部分内容来源于网上其他博客,不过由于时间长了,忘记参考的是谁的了,在此说声抱歉。。
先贴出一段html页面:
<html> <head> <meta http-equiv="content-type" content="text/html;charset=GBK"> <title>HTML Parser</title> <meta name="generator" content="Namo WebEditor"> </head> <body> <table width=620 border=0 cellpadding=1 cellspacing=0 bgcolor=#0066cc> <tr> <td width=100%> <table width=100% border=0 cellpadding=4 cellspacing=0 bgcolor=#D3E5FB> <tr bgcolor=#D3E5FB> <td width=20%><font size="2" face="Arial,Verdana"><b>想学习 Name</b></font><br> </td> <td width=13%><font size="2" face="Arial,Verdana"><b>Result</b></font><br> </td> <td width=8%><font size="2" face="Arial,Verdana"><b>Time</b></font><br> </td> <td width=59%><font size="2" face="Arial,Verdana"><b>Synopsis</b></font><br> </td> </tr> <tr bgcolor=#eeeeee> <td width=20%><font size="1" face="Arial,Verdana"><b>9</b> 想学习</font><br> </td> <td width=13%><font size="1" face="Arial,Verdana"><font color=#ff0033>+FAIL</font> <a href="v4_wireless_802.1x_full/cdrouter_dhcp_20.txt">想学习</a></font><br> </td> <td width=8%><font size="1" face="Arial,Verdana">12:31</font><br> </td> <td width=59%><font size="1" face="Arial,Verdana">想学习</font><br> </td> </tr> <tr bgcolor=#ffffff> <td width=20%><font size="1" face="Arial,Verdana"><b>1</b> cdrouter_basic_1</font><br> </td> <td width=13%><font size="1" face="Arial,Verdana">Pass <a href="v4_wireless_802.1x_full/cdrouter_basic_1.txt">想学习</a></font><br> </td> <td width=8%><font size="1" face="Arial,Verdana">00:00</font><br> </td> <td width=59%><font size="1" face="Arial,Verdana">想学习</font><br> </td> </tr> </table> </td> </tr> </table> </body> </html>
在网上搜索了一下jericho-html-3.3这个插件,用来解析table,的确很方便。
代码如下:
package com.xxx.hbuassys.test;
import java.net.URL;
import java.util.Iterator;
import java.util.List;
import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.HTMLElementName;
import net.htmlparser.jericho.Segment;
import net.htmlparser.jericho.Source;
public class HtmlParser
{
public static void main(String[] args) throws Exception
{
String sourceUrlString="test.html";
if(sourceUrlString.indexOf(':') == -1)
sourceUrlString ="file:"+sourceUrlString;
Source source=new Source(new URL(sourceUrlString));
List Elements_TABLE=source.getAllElements(HTMLElementName.TABLE);
Elements_TABLE.remove(0);//由于table相互嵌套,我们需要的是第二个,所以删掉第一个
Iterator it_TABLE = Elements_TABLE.iterator();
while(it_TABLE.hasNext())
{
Element Element_TABLE = (Element)it_TABLE.next();
// System.out.println("**"+Element_TABLE.toString()+"\n**");
Segment getContent_TABLE = (Segment)Element_TABLE.getContent();
List Elements_TR = getContent_TABLE.getAllElements(HTMLElementName.TR);
Iterator it_TR = Elements_TR.iterator();
while(it_TR.hasNext())
{
Element Element_TR = (Element)it_TR.next();
Segment getContent_TR = (Segment)Element_TR.getContent();
List Elements_FONT = getContent_TR.getAllElements(HTMLElementName.FONT);
Iterator it_FONT = Elements_FONT.iterator();
int i = 1;
while(it_FONT.hasNext())
{
Element Element_FONT = (Element)it_FONT.next();
Segment getContent_FONT = (Segment)Element_FONT.getContent();
String a1 = getContent_FONT.toString();
System.out.println(i + " = " + Element_FONT.getContent().getTextExtractor().toString());
i++;
}
System.out.println();
}
}
}
}
结果:
1 = 想学习 Name
2 = Result
3 = Time
4 = Synopsis
1 = 9 想学习
2 = +FAIL 想学习
3 = +FAIL
4 = 12:31
5 = 想学习
1 = 1 cdrouter_basic_1
2 = Pass 想学习
3 = 00:00
4 = 想学习
大致的思路就是,先取出所有的table标签,然后对需要的table进行解析,取出里面的tr,在从tr里面取出td这样就可以得到我们需要的内容了。
如果只说到这,那么就跟网上其他人讲的没有什么区别了。
因为项目的需要,使用此插件发现了一个问题:
如果html页面的编码是UTF-8的格式,那么解析出来的内容就会是乱码,如果直接对这些乱码编码,采用new String(str.getBytes(),"GBK");等之类的操作都不能解决问题,本人亲自测试过。
例如html页面变为:
<html> <head> <meta http-equiv="content-type" content="text/html;charset=UTF-8"> <title>HTML Parser</title> <meta name="generator" content="Namo WebEditor"> </head> <body> <table width=620 border=0 cellpadding=1 cellspacing=0 bgcolor=#0066cc> <tr> <td width=100%> <table width=100% border=0 cellpadding=4 cellspacing=0 bgcolor=#D3E5FB> <tr bgcolor=#D3E5FB> <td width=20%><font size="2" face="Arial,Verdana"><b>想学习 Name</b></font><br> </td> <td width=13%><font size="2" face="Arial,Verdana"><b>Result</b></font><br> </td> <td width=8%><font size="2" face="Arial,Verdana"><b>Time</b></font><br> </td> <td width=59%><font size="2" face="Arial,Verdana"><b>Synopsis</b></font><br> </td> </tr> <tr bgcolor=#eeeeee> <td width=20%><font size="1" face="Arial,Verdana"><b>9</b> 想学习</font><br> </td> <td width=13%><font size="1" face="Arial,Verdana"><font color=#ff0033>+FAIL</font> <a href="v4_wireless_802.1x_full/cdrouter_dhcp_20.txt">想学习</a></font><br> </td> <td width=8%><font size="1" face="Arial,Verdana">12:31</font><br> </td> <td width=59%><font size="1" face="Arial,Verdana">想学习</font><br> </td> </tr> <tr bgcolor=#ffffff> <td width=20%><font size="1" face="Arial,Verdana"><b>1</b> cdrouter_basic_1</font><br> </td> <td width=13%><font size="1" face="Arial,Verdana">Pass <a href="v4_wireless_802.1x_full/cdrouter_basic_1.txt">想学习</a></font><br> </td> <td width=8%><font size="1" face="Arial,Verdana">00:00</font><br> </td> <td width=59%><font size="1" face="Arial,Verdana">想学习</font><br> </td> </tr> </table> </td> </tr> </table> </body> </html>
1 = ???? Name
2 = Result
3 = Time
4 = Synopsis
1 = 9 ????
2 = +FAIL ????
3 = +FAIL
4 = 12:31
5 = ????
1 = 1 cdrouter_basic_1
2 = Pass ????
3 = 00:00
4 = ????
采用的方法是:改变<meta http-equiv="content-type" content="text/html;charset=UTF-8">变为:<meta http-equiv="content-type" content="text/html;charset=GBK">
详细情况,参考代码如下:
package com.xxx.hbuassys.test;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Iterator;
import java.util.List;
import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.HTMLElementName;
import net.htmlparser.jericho.Segment;
import net.htmlparser.jericho.Source;
public class HtmlParser
{
public static void main(String[] args) throws Exception
{
BufferedReader reader=new BufferedReader(new InputStreamReader(new FileInputStream(new File("test.html"))));
// BufferedReader reader=new BufferedReader(new FileReader(new File("test.html")));
StringBuilder sbf=new StringBuilder();
String str=null;
while((str=reader.readLine())!=null){
sbf.append(str).append("\n");
}
//解决中文乱码的方法
String html=sbf.toString().replace("<meta http-equiv=\"content-type\" content=\"text/html;charset=UTF-8\">", "<meta http-equiv=\"content-type\" content=\"text/html;charset=GBK\">");
// System.out.println(html);
Source source=new Source(html);
List Elements_TABLE=source.getAllElements(HTMLElementName.TABLE);
Elements_TABLE.remove(0);//由于table相互嵌套,我们需要的是第二个,所以删掉第一个
Iterator it_TABLE = Elements_TABLE.iterator();
while(it_TABLE.hasNext())
{
Element Element_TABLE = (Element)it_TABLE.next();
// System.out.println("**"+Element_TABLE.toString()+"\n**");
Segment getContent_TABLE = (Segment)Element_TABLE.getContent();
List Elements_TR = getContent_TABLE.getAllElements(HTMLElementName.TR);
Iterator it_TR = Elements_TR.iterator();
while(it_TR.hasNext())
{
Element Element_TR = (Element)it_TR.next();
Segment getContent_TR = (Segment)Element_TR.getContent();
List Elements_FONT = getContent_TR.getAllElements(HTMLElementName.FONT);
Iterator it_FONT = Elements_FONT.iterator();
int i = 1;
while(it_FONT.hasNext())
{
Element Element_FONT = (Element)it_FONT.next();
Segment getContent_FONT = (Segment)Element_FONT.getContent();
String a1 = getContent_FONT.toString();
System.out.println(i + " = " + Element_FONT.getContent().getTextExtractor().toString());
i++;
}
System.out.println();
}
}
}
}
1 = 想学习 Name
2 = Result
3 = Time
4 = Synopsis
1 = 9 想学习
2 = +FAIL 想学习
3 = +FAIL
4 = 12:31
5 = 想学习
1 = 1 cdrouter_basic_1
2 = Pass 想学习
3 = 00:00
4 = 想学习
html解析器——jericho-html-3.3解析table,布布扣,bubuko.com
html解析器——jericho-html-3.3解析table
原文地址:http://blog.csdn.net/xxx823952375/article/details/30969607