码迷,mamicode.com
首页 > Web开发 > 详细

Jsoup

时间:2017-06-14 10:15:43      阅读:272      评论:0      收藏:0      [点我收藏+]

标签:url   put   imp   provides   user   main   ror   inpu   form   

  jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。(百度百科)jar包下载,可以看到如下的案例:

 

package com.gqx.jsoupTest;

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.Iterator;

import java.util.Set;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Crawler {

	public static void main(String[] args) throws Exception {
		// TODO Auto-generated method stub
		Document document= Jsoup.connect("http://www.cnblogs.com/helloworldcode/").get();
		Elements select=document.select("a[id=Header1_HeaderTitle]");
		for (Element element : select) {
			System.out.println(element.text());
		}
	}
	

}

  其中关于Jsoup的connect()方法中:API描述如下:

public static Connection connect(String url)
//Creates a new Connection to a URL. Use to fetch and parse a HTML page.
Use examples:

Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();
Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post();
Parameters:
//url - URL to connect to. The protocol must be http or https.
Returns:
//the connection. You can add data, cookies, and headers; set the user-agent, referrer, method; and then execute.

  可以看出通过Jsoup.connect(String url)就可以得到一个connection对象,继续查看其定义,A Connection provides a convenient interface to fetch content from the web, and parse them into Documents。容易看出,通过connection对象我们就可以得到了网页的所有内容,现在问题是怎么在类中得到获得的标签元素以及内容。就是从网页中的所有html内容转化为一个document对象。这个时候就是可以通过get()对象获得。

Document get()
      throws IOException
Execute the request as a GET, and parse the result.
Returns:
parsed Document
Throws:
MalformedURLException - if the request URL is not a HTTP or HTTPS URL, or is otherwise malformed
HttpStatusException - if the response is not OK and HTTP response errors are not ignored
UnsupportedMimeTypeException - if the response mime type is not supported and those errors are not ignored
SocketTimeoutException - if the connection times out
IOException - on error

 

 

Jsoup

标签:url   put   imp   provides   user   main   ror   inpu   form   

原文地址:http://www.cnblogs.com/helloworldcode/p/7007055.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!